Don't set local node on cluster state used for node join validation #23311

ywelsch · 2017-02-22T18:51:41Z

When a node wants to join a cluster, it sends a join request to the master. The master then sends a join validation request to the node. This checks that the node can deserialize the current cluster state that exists on the master and that it can thus handle all the indices that are currently in the cluster (see #21830).

The current code can trip an assertion as it does not take the cluster state as is but sets itself as the local node on the cluster state. This can result in an inconsistent DiscoveryNodes object as the local node is not yet part of the cluster state and a node with same id but different address can still exist in the cluster state. Also another node with the same address but different id can exist in the cluster state if multiple nodes are run on the same machine and ports have been swapped after node crashes/restarts.

Build failure:

https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+master+multijob-unix-compatibility/os=ubuntu/643/consoleFull

From the logs:

at java.lang.Thread.run(Thread.java:745)Throwable #3: com.carrotsearch.randomizedtesting.UncaughtExceptionError: Captured an uncaught exception in thread: Thread[id=7061, name=elasticsearch[node_t4][__mock_network_thread][T#6], state=RUNNABLE, group=TGRP-MinimumMasterNodesIT]
   > Caused by: java.lang.AssertionError: building disco nodes from network doesn't pass preflight: can't add node {node_t4}{YTZdF3xkQQmT-vN8PRNOvA}{K8X3Bf9fRb26y3FQWWmgkg}{127.0.0.1}{127.0.0.1:9783}, found existing node {node_t3}{cP9AQejSTBWvJmopV5y4Mw}{pT5FBfktS5-3OX1PiNEVvg}{127.0.0.1}{127.0.0.1:9783} with same address
   > 	at __randomizedtesting.SeedInfo.seed([840C6FA44DFAF8D3]:0)
   > 	at org.elasticsearch.cluster.node.DiscoveryNodes.readFrom(DiscoveryNodes.java:543)
   > 	at org.elasticsearch.cluster.ClusterState.readFrom(ClusterState.java:662)
   > 	at org.elasticsearch.discovery.zen.MembershipAction$ValidateJoinRequest.readFrom(MembershipAction.java:173)
   > 	at org.elasticsearch.transport.TcpTransport.handleRequest(TcpTransport.java:1441)
   > 	at org.elasticsearch.transport.TcpTransport.messageReceived(TcpTransport.java:1328)
   > 	at org.elasticsearch.transport.MockTcpTransport.readMessage(MockTcpTransport.java:175)
   > 	at org.elasticsearch.transport.MockTcpTransport.access$1000(MockTcpTransport.java:72)
   > 	at org.elasticsearch.transport.MockTcpTransport$MockChannel$1.lambda$doRun$0(MockTcpTransport.java:359)
   > 	at org.elasticsearch.common.util.CancellableThreads.executeIO(CancellableThreads.java:105)
   > 	at org.elasticsearch.transport.MockTcpTransport$MockChannel$1.doRun(MockTcpTransport.java:359)
   > 	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
   > 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
   > 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
   > 	at java.lang.Thread.run(Thread.java:745)Throwable #4: com.carrotsearch.randomizedtesting.UncaughtExceptionError: Captured an uncaught exception in thread: Thread[id=7057, name=elasticsearch[node_t5][__mock_network_thread][T#9], state=RUNNABLE, group=TGRP-MinimumMasterNodesIT]

jasontedor

LGTM.

…23311) When a node wants to join a cluster, it sends a join request to the master. The master then sends a join validation request to the node. This checks that the node can deserialize the current cluster state that exists on the master and that it can thus handle all the indices that are currently in the cluster (see #21830). The current code can trip an assertion as it does not take the cluster state as is but sets itself as the local node on the cluster state. This can result in an inconsistent DiscoveryNodes object as the local node is not yet part of the cluster state and a node with same id but different address can still exist in the cluster state. Also another node with the same address but different id can exist in the cluster state if multiple nodes are run on the same machine and ports have been swapped after node crashes/restarts.

* master: (54 commits) Keep the pipeline handler queue small initially Do not create String instances in 'Strings' methods accepting StringBuilder (elastic#22907) Tests: fix AwsS3ServiceImplTests Remove abstract InternalMetricsAggregation class (elastic#23326) Add BulkRequest support to High Level Rest client (elastic#23312) Wrap getCredentials() in a doPrivileged() block (elastic#23297) Respect promises on pipelined responses Align REST specs for HEAD requests Remove unnecessary result sorting in SearchPhaseController (elastic#23321) Fix SamplerAggregatorTests to have stable and predictable docIds Tests: Ensure multi node integ tests wait on first node Relocate a comment in HttpPipeliningHandler Add comments to HttpPipeliningHandler [TEST] Fix incorrect test cluster name in cluster health doc tests Build: Change location in zip of license and notice inclusion for plugins (elastic#23316) Script: Fix value of `ctx._now` to be current epoch time in milliseconds (elastic#23175) Build: Rework integ test setup and shutdown to ensure stop runs when desired (elastic#23304) Handle long overflow when adding paths' totals Don't set local node on cluster state used for node join validation (elastic#23311) Ensure that releasing listener is called ...

Don't set local node on cluster state used for node join validation

289b238

ywelsch added :Cluster >bug v5.2.2 v5.3.0 v5.4.0 v6.0.0-alpha1 labels Feb 22, 2017

jasontedor approved these changes Feb 22, 2017

View reviewed changes

ywelsch merged commit 0f88f21 into elastic:master Feb 22, 2017

clintongormley added :Distributed/Distributed A catch all label for anything in the Distributed Area. If you aren't sure, use this one. and removed :Cluster labels Feb 13, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Don't set local node on cluster state used for node join validation #23311

Don't set local node on cluster state used for node join validation #23311

ywelsch commented Feb 22, 2017

jasontedor left a comment

Don't set local node on cluster state used for node join validation #23311

Don't set local node on cluster state used for node join validation #23311

Conversation

ywelsch commented Feb 22, 2017

jasontedor left a comment

Choose a reason for hiding this comment