Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ensure we don't use a remote profile if cluster name matches #31331

Merged
merged 6 commits into from
Jun 17, 2018

Conversation

s1monw
Copy link
Contributor

@s1monw s1monw commented Jun 14, 2018

If we are running into a race condition between a node being configured
to be a remote node for cross cluster search etc. and that node joining
the cluster we might connect to that node with a remote profile. If that
node now joins the cluster it connected to it as a CCS remote node we use
the wrong profile and can't use bulk connections etc. anymore. This change
uses the remote profile only if we connect to a node that has a different cluster
name than the local cluster. This is not a perfect fix for this situation but
is the safe option while potentially only loose a small optimization of using
less connections per node which is small anyways since we only connect to a
small set of nodes.

Closes #29321

If we are running into a race condition between a node being configured
to be a remote node for cross cluster search etc. and that node joining
the cluster we might connect to that node with a remote profile. If that
node now joins the cluster it connected to it as a CCS remote node we use
the wrong profile and can't use bulk connections etc. anymore. This change
uses the remote profile only if we connect to a node that has a different cluster
name than the local cluster. This is not a perfect fix for this situation but
is the safe option while potentially only loose a small optimization of using
less connections per node which is small anyways since we only connect to a
small set of nodes.

Closes elastic#29321
@s1monw s1monw added >bug :Distributed/Network Http and internode communication implementations v7.0.0 v6.4.0 v6.3.1 v5.6.11 labels Jun 14, 2018
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-core-infra

@s1monw s1monw requested a review from jasontedor June 14, 2018 13:13
private ConnectionProfile getRemoteProfile(ClusterName name) {
// we can only compare the cluster name to make a decision if we should use a remote profile
// we can't use a cluster UUID here since we could be connecting to that remote cluster before
// the remote node has joined it cluster and have a cluster UUID. The fact that we just loose a
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/it cluster/its cluster also s/loose/lose

// we can only compare the cluster name to make a decision if we should use a remote profile
// we can't use a cluster UUID here since we could be connecting to that remote cluster before
// the remote node has joined it cluster and have a cluster UUID. The fact that we just loose a
// rather smallish optimization on the connection layer under certain situations were remote clusters
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/were/where

Copy link
Member

@javanna javanna left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am ok with this @s1monw thanks, would it be possible to add a test, it seems like the issue that's getting fixed is quite nasty.

@javanna javanna dismissed their stale review June 14, 2018 14:33

hit wrong button

@s1monw
Copy link
Contributor Author

s1monw commented Jun 14, 2018

@javanna I pushed some tests I don't think I can integration test this in a sane way

Copy link
Contributor

@Tim-Brooks Tim-Brooks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM - I mainly have a question about connecting with a null connection profile vs. not connecting at all and letting the normal cluster formation process create the connection.

if (nodePredicate.test(handshakeNode) && connectedNodes.size() < maxNumRemoteConnections) {
transportService.connectToNode(handshakeNode, remoteProfile);
transportService.connectToNode(handshakeNode, getRemoteProfile(handshakeResponse.getClusterName()));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess my main question is, do we want to connect? I guess I don't see any functional difference. It just seems kind of weird that a "normal" cluster node connection is initiated from the RemoteClusterConnection and not the normal cluster formation work. And would be a catch in the future if we ever made a change to how normal cluster connections work.

Is there maybe a different race condition concern if we do not initiate the connection here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

well I think if that cluster has coincidently the same name we would never instantiate this connection? If we don't connect, when would we connect to it? I can think of way more complicated things like waiting for the remote cluster to be formed etc. and then compare cluster UUIDs but that would require the local cluster to be formed and the remote one which is a complicating many many things. I think we can investigate this going forward ie. to only start connections once we have formed a cluster and never establish a connection to a node that isn't part of a cluster but this would be too complex to backport and this is a bug I want to be fixed in 5.6.x. I will create a followup issue for this and look into it what it takes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh right. That makes sense.

@s1monw s1monw merged commit 3d5f113 into elastic:master Jun 17, 2018
@s1monw s1monw deleted the issues/29321 branch June 17, 2018 11:33
s1monw added a commit that referenced this pull request Jun 17, 2018
If we are running into a race condition between a node being configured
to be a remote node for cross cluster search etc. and that node joining
the cluster we might connect to that node with a remote profile. If that
node now joins the cluster it connected to it as a CCS remote node we use
the wrong profile and can't use bulk connections etc. anymore. This change
uses the remote profile only if we connect to a node that has a different cluster
name than the local cluster. This is not a perfect fix for this situation but
is the safe option while potentially only loose a small optimization of using
less connections per node which is small anyways since we only connect to a
small set of nodes.

Closes #29321
s1monw added a commit that referenced this pull request Jun 17, 2018
If we are running into a race condition between a node being configured
to be a remote node for cross cluster search etc. and that node joining
the cluster we might connect to that node with a remote profile. If that
node now joins the cluster it connected to it as a CCS remote node we use
the wrong profile and can't use bulk connections etc. anymore. This change
uses the remote profile only if we connect to a node that has a different cluster
name than the local cluster. This is not a perfect fix for this situation but
is the safe option while potentially only loose a small optimization of using
less connections per node which is small anyways since we only connect to a
small set of nodes.

Closes #29321
s1monw added a commit to s1monw/elasticsearch that referenced this pull request Jun 17, 2018
…#31331)

If we are running into a race condition between a node being configured
to be a remote node for cross cluster search etc. and that node joining
the cluster we might connect to that node with a remote profile. If that
node now joins the cluster it connected to it as a CCS remote node we use
the wrong profile and can't use bulk connections etc. anymore. This change
uses the remote profile only if we connect to a node that has a different cluster
name than the local cluster. This is not a perfect fix for this situation but
is the safe option while potentially only loose a small optimization of using
less connections per node which is small anyways since we only connect to a
small set of nodes.

Closes elastic#29321
s1monw added a commit that referenced this pull request Jun 17, 2018
If we are running into a race condition between a node being configured
to be a remote node for cross cluster search etc. and that node joining
the cluster we might connect to that node with a remote profile. If that
node now joins the cluster it connected to it as a CCS remote node we use
the wrong profile and can't use bulk connections etc. anymore. This change
uses the remote profile only if we connect to a node that has a different cluster
name than the local cluster. This is not a perfect fix for this situation but
is the safe option while potentially only loose a small optimization of using
less connections per node which is small anyways since we only connect to a
small set of nodes.

Closes #29321
s1monw added a commit to s1monw/elasticsearch that referenced this pull request Jun 19, 2018
This change prevent remote cluster connections to be established
to nodes that have not yet joined a cluster and don't have a cluster
UUID. This allows to effectivly detect nodes that are part of the local
cluster. To compare the local cluster UUID to the remote nodes cluster UUID
we need to wait until we recovered a state and a master is elected before
we can connect to remote clusters.

Relates to elastic#31331
dnhatn added a commit that referenced this pull request Jun 19, 2018
* 6.x:
  Add get stored script and delete stored script to high level REST API
  Increasing skip version for failing test on 6.x
  Skip get_alias tests for 5.x (#31397)
  Fix defaults in GeoShapeFieldMapper output (#31302)
  Test: better error message on failure
  Mute DefaultShardsIT#testDefaultShards test
  Fix reference to XContentBuilder.string() (#31337)
  [DOCS] Adds monitoring breaking change (#31369)
  [DOCS] Adds security breaking change (#31375)
  [DOCS] Backports breaking change (#31373)
  RestAPI: Reject forcemerge requests with a body (#30792)
  Docs: Use the default distribution to test docs (#31251)
  Use system context for cluster state update tasks (#31241)
  [DOCS] Adds testing for security APIs (#31345)
  [DOCS] Removes ML item from release highlights
  [DOCS] Removes breaking change (#31376)
  REST high-level client: add validate query API (#31077)
  Move language analyzers from server to analysis-common module. (#31300)
  Expose lucene's RemoveDuplicatesTokenFilter (#31275)
  [Test] Fix :example-plugins:rest-handler on Windows
  Delete typos in SAML docs (#31199)
  Ensure we don't use a remote profile if cluster name matches (#31331)
  Test: Skip alias tests that failed all weekend
  [DOCS] Fix version in SQL JDBC Maven template
  [DOCS] Improve install and setup section for SQL JDBC
  Add ingest-attachment support for per document `indexed_chars` limit (#31352)
  SQL: Fix rest endpoint names in node stats (#31371)
  [DOCS] Fixes small issue in release notes
  Support for remote path in reindex api Closes #22913
  [ML] Put ML filter API response should contain the filter (#31362)
  Remove trial status info from start trial doc (#31365)
  [DOCS] Added links in breaking changes pages
  [DOCS] Adds links to release notes and highlights
  Docs: Document changes in rest client
  QA: Fix tribe tests to use node selector
  REST Client: NodeSelector for node attributes (#31296)
  LLClient: Fix assertion on windows
  LLClient: Support host selection (#30523)
  Add QA project and fixture based test for discovery-ec2 plugin (#31107)
  [ML] Hold ML filter items in sorted set (#31338)
  [ML] Add description to ML filters (#31330)
dnhatn added a commit that referenced this pull request Jun 19, 2018
* master:
  Add get stored script and delete stored script to high level REST API - post backport fix
  Add get stored script and delete stored script to high level REST API (#31355)
  Core: Combine Action and GenericAction (#31405)
  Fix reference to XContentBuilder.string() (#31337)
  Avoid sending duplicate remote failed shard requests (#31313)
  Fix defaults in GeoShapeFieldMapper output (#31302)
  RestAPI: Reject forcemerge requests with a body (#30792)
  Packaging: Remove windows bin files from the tar distribution (#30596)
  Docs: Use the default distribution to test docs (#31251)
  [DOCS] Adds testing for security APIs (#31345)
  Clarify that IP range data can be specified in CIDR notation. (#31374)
  Use system context for cluster state update tasks (#31241)
  Percentile/Ranks should return null instead of NaN when empty (#30460)
  REST high-level client: add validate query API (#31077)
  Move language analyzers from server to analysis-common module. (#31300)
  [Test] Fix :example-plugins:rest-handler on Windows
  Expose lucene's RemoveDuplicatesTokenFilter (#31275)
  Reload secure settings for plugins (#31383)
  Remove some cases in FieldTypeLookupTests that are no longer relevant. (#31381)
  Ensure we don't use a remote profile if cluster name matches (#31331)
  [TEST] Double write alias fault (#30942)
  [DOCS] Fix version in SQL JDBC Maven template
  [DOCS] Improve install and setup section for SQL JDBC
  SQL: Fix rest endpoint names in node stats (#31371)
  Support for remote path in reindex api - post backport fix Closes #22913
  [ML] Put ML filter API response should contain the filter (#31362)
  Support for remote path in reindex api (#31290)
  Add byte array pooling to nio http transport (#31349)
  Remove trial status info from start trial doc (#31365)
  [DOCS] Adds links to release notes and highlights
  add is-write-index flag to aliases (#30942)
  Add rollover-creation-date setting to rolled over index (#31144)
  [ML] Hold ML filter items in sorted set (#31338)
  [Tests] Fix edge case in ScriptedMetricAggregatorTests (#31357)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Cross-cluster search and default connections can get crossed
5 participants