Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Don't set local node on cluster state used for node join validation #23311

Conversation

ywelsch
Copy link
Contributor

@ywelsch ywelsch commented Feb 22, 2017

When a node wants to join a cluster, it sends a join request to the master. The master then sends a join validation request to the node. This checks that the node can deserialize the current cluster state that exists on the master and that it can thus handle all the indices that are currently in the cluster (see #21830).

The current code can trip an assertion as it does not take the cluster state as is but sets itself as the local node on the cluster state. This can result in an inconsistent DiscoveryNodes object as the local node is not yet part of the cluster state and a node with same id but different address can still exist in the cluster state. Also another node with the same address but different id can exist in the cluster state if multiple nodes are run on the same machine and ports have been swapped after node crashes/restarts.

Build failure:

https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+master+multijob-unix-compatibility/os=ubuntu/643/consoleFull

From the logs:

at java.lang.Thread.run(Thread.java:745)Throwable #3: com.carrotsearch.randomizedtesting.UncaughtExceptionError: Captured an uncaught exception in thread: Thread[id=7061, name=elasticsearch[node_t4][__mock_network_thread][T#6], state=RUNNABLE, group=TGRP-MinimumMasterNodesIT]
   > Caused by: java.lang.AssertionError: building disco nodes from network doesn't pass preflight: can't add node {node_t4}{YTZdF3xkQQmT-vN8PRNOvA}{K8X3Bf9fRb26y3FQWWmgkg}{127.0.0.1}{127.0.0.1:9783}, found existing node {node_t3}{cP9AQejSTBWvJmopV5y4Mw}{pT5FBfktS5-3OX1PiNEVvg}{127.0.0.1}{127.0.0.1:9783} with same address
   > 	at __randomizedtesting.SeedInfo.seed([840C6FA44DFAF8D3]:0)
   > 	at org.elasticsearch.cluster.node.DiscoveryNodes.readFrom(DiscoveryNodes.java:543)
   > 	at org.elasticsearch.cluster.ClusterState.readFrom(ClusterState.java:662)
   > 	at org.elasticsearch.discovery.zen.MembershipAction$ValidateJoinRequest.readFrom(MembershipAction.java:173)
   > 	at org.elasticsearch.transport.TcpTransport.handleRequest(TcpTransport.java:1441)
   > 	at org.elasticsearch.transport.TcpTransport.messageReceived(TcpTransport.java:1328)
   > 	at org.elasticsearch.transport.MockTcpTransport.readMessage(MockTcpTransport.java:175)
   > 	at org.elasticsearch.transport.MockTcpTransport.access$1000(MockTcpTransport.java:72)
   > 	at org.elasticsearch.transport.MockTcpTransport$MockChannel$1.lambda$doRun$0(MockTcpTransport.java:359)
   > 	at org.elasticsearch.common.util.CancellableThreads.executeIO(CancellableThreads.java:105)
   > 	at org.elasticsearch.transport.MockTcpTransport$MockChannel$1.doRun(MockTcpTransport.java:359)
   > 	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
   > 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
   > 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
   > 	at java.lang.Thread.run(Thread.java:745)Throwable #4: com.carrotsearch.randomizedtesting.UncaughtExceptionError: Captured an uncaught exception in thread: Thread[id=7057, name=elasticsearch[node_t5][__mock_network_thread][T#9], state=RUNNABLE, group=TGRP-MinimumMasterNodesIT]

Copy link
Member

@jasontedor jasontedor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@ywelsch ywelsch merged commit 0f88f21 into elastic:master Feb 22, 2017
ywelsch added a commit that referenced this pull request Feb 22, 2017
…23311)

When a node wants to join a cluster, it sends a join request to the master. The master then sends a join validation request to the node. This checks that the node can deserialize the current cluster state that exists on the master and that it can thus handle all the indices that are currently in the cluster (see #21830).

The current code can trip an assertion as it does not take the cluster state as is but sets itself as the local node on the cluster state. This can result in an inconsistent DiscoveryNodes object as the local node is not yet part of the cluster state and a node with same id but different address can still exist in the cluster state. Also another node with the same address but different id can exist in the cluster state if multiple nodes are run on the same machine and ports have been swapped after node crashes/restarts.
ywelsch added a commit that referenced this pull request Feb 22, 2017
…23311)

When a node wants to join a cluster, it sends a join request to the master. The master then sends a join validation request to the node. This checks that the node can deserialize the current cluster state that exists on the master and that it can thus handle all the indices that are currently in the cluster (see #21830).

The current code can trip an assertion as it does not take the cluster state as is but sets itself as the local node on the cluster state. This can result in an inconsistent DiscoveryNodes object as the local node is not yet part of the cluster state and a node with same id but different address can still exist in the cluster state. Also another node with the same address but different id can exist in the cluster state if multiple nodes are run on the same machine and ports have been swapped after node crashes/restarts.
ywelsch added a commit that referenced this pull request Feb 22, 2017
…23311)

When a node wants to join a cluster, it sends a join request to the master. The master then sends a join validation request to the node. This checks that the node can deserialize the current cluster state that exists on the master and that it can thus handle all the indices that are currently in the cluster (see #21830).

The current code can trip an assertion as it does not take the cluster state as is but sets itself as the local node on the cluster state. This can result in an inconsistent DiscoveryNodes object as the local node is not yet part of the cluster state and a node with same id but different address can still exist in the cluster state. Also another node with the same address but different id can exist in the cluster state if multiple nodes are run on the same machine and ports have been swapped after node crashes/restarts.
jasontedor added a commit to jasontedor/elasticsearch that referenced this pull request Feb 24, 2017
* master: (54 commits)
  Keep the pipeline handler queue small initially
  Do not create String instances in 'Strings' methods accepting StringBuilder (elastic#22907)
  Tests: fix AwsS3ServiceImplTests
  Remove abstract InternalMetricsAggregation class (elastic#23326)
  Add BulkRequest support to High Level Rest client (elastic#23312)
  Wrap getCredentials() in a doPrivileged() block (elastic#23297)
  Respect promises on pipelined responses
  Align REST specs for HEAD requests
  Remove unnecessary result sorting in SearchPhaseController (elastic#23321)
  Fix SamplerAggregatorTests to have stable and predictable docIds
  Tests: Ensure multi node integ tests wait on first node
  Relocate a comment in HttpPipeliningHandler
  Add comments to HttpPipeliningHandler
  [TEST] Fix incorrect test cluster name in cluster health doc tests
  Build: Change location in zip of license and notice inclusion for plugins (elastic#23316)
  Script: Fix value of `ctx._now` to be current epoch time in milliseconds (elastic#23175)
  Build: Rework integ test setup and shutdown to ensure stop runs when desired (elastic#23304)
  Handle long overflow when adding paths' totals
  Don't set local node on cluster state used for node join validation (elastic#23311)
  Ensure that releasing listener is called
  ...
@clintongormley clintongormley added :Distributed/Distributed A catch all label for anything in the Distributed Area. If you aren't sure, use this one. and removed :Cluster labels Feb 13, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug :Distributed/Distributed A catch all label for anything in the Distributed Area. If you aren't sure, use this one. v5.2.2 v5.3.0 v5.4.0 v6.0.0-alpha1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants