CI: TribeIT#testClusterStateNodes fails with "can't update [discovery.zen.minimum_master_nodes]" #23695

cbuescher · 2017-03-22T12:26:08Z

Build: https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+master+multijob-intake/974

There are several of these failures in the logs:

[node_sm0/AmCrGgAAQACdSFhqAAAAAA] failed to prepareCommit settings for [Updater for: {
  1>   "key" : "discovery.zen.minimum_master_nodes",
  1>   "properties" : [
  1>     "Dynamic",
  1>     "NodeScope"
  1>   ],
  1>   "is_group_setting" : false,
  1>   "default" : "-1"
  1> }]
  1> java.lang.IllegalArgumentException: illegal value can't update [discovery.zen.minimum_master_nodes] from [-1] to [3]
  1> 	at org.elasticsearch.common.settings.Setting$Updater.getValue(Setting.java:594) ~[main/:?]
  1> 	at org.elasticsearch.common.settings.AbstractScopedSettings$SettingUpdater.updater(AbstractScopedSettings.java:346) ~[main/:?]
  1> 	at org.elasticsearch.common.settings.AbstractScopedSettings.applySettings(AbstractScopedSettings.java:166) ~[main/:?]
  1> 	at org.elasticsearch.cluster.service.ClusterService.publishAndApplyChanges(ClusterService.java:808) ~[main/:?]
  1> 	at org.elasticsearch.cluster.service.ClusterService.runTasks(ClusterService.java:633) ~[main/:?]
  1> 	at org.elasticsearch.cluster.service.ClusterService$UpdateTask.run(ClusterService.java:1117) ~[main/:?]
  1> 	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:544) ~[main/:?]
  1> 	at

The text was updated successfully, but these errors were encountered:

cbuescher · 2017-03-22T12:26:49Z

Couldn't reproduce, but here's the line in any case:

gradle :core:integTest -Dtests.seed=CB45553B395B1BE4 -Dtests.class=org.elasticsearch.tribe.TribeIT -Dtests.method="testClusterStateNodes" -Dtests.security.manager=true -Dtests.locale=es-UY -Dtests.timezone=Asia/Bangkok

cbuescher · 2017-03-22T12:38:05Z

@ywelsch can you take a look at this and re-assign if needed?

ywelsch · 2017-03-22T14:36:37Z

This exhibits a real issue (but I'm not sure yet why the test fails) with the minimum_master_nodes dynamic setting update check.

To reproduce the issue:

Start 2 nodes
Dynamically update cluster settings by setting minimum_master_nodes to 2
Start another node
When that third node joins the cluster, it will throw said exception because it will check if the number of master nodes in the previous cluster state is at least as high as the minimum_master_nodes setting in the new cluster state. The number of master nodes in the previous cluster state is 0 (cluster state before the node joins the cluster) and the setting is 2.

Caused by: java.lang.IllegalArgumentException: cannot set discovery.zen.minimum_master_nodes to more than the current master nodes count [0]
	at org.elasticsearch.discovery.zen.ZenDiscovery.lambda$new$2(ZenDiscovery.java:176) ~[main/:?]
	at org.elasticsearch.common.settings.Setting$Updater.getValue(Setting.java:592) ~[main/:?]
	... 11 more

What's bad about this is that the node will not apply any of the other cluster settings in that cluster state publishing round.

/cc: @bleskes

Tim-Brooks · 2017-04-04T15:22:38Z

It looks like there was a failure related to this today. On the G1Gc master build.

https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+master+g1gc/2327/console

gradle :core:integTest -Dtests.seed=8C71214509D0E0F4 -Dtests.class=org.elasticsearch.tribe.TribeIT -Dtests.method="testClusterStateNodes" -Dtests.security.manager=true -Dtests.jvm.argline="-XX:-UseConcMarkSweepGC -XX:+UseG1GC" -Dtests.locale=es-PE -Dtests.timezone=Africa/Gaborone

…s master The purpose of this validation is to make sure that the master doesn't step down due to a change in master nodes, which also means that there is no way to revert an accidental change. Since we validate using the current cluster state (and not the one from which the settings come from) we have to be careful and only validate if the local node is already a master. Doing so all the time causes subtle issues. For example, a node that joins a cluster has no nodes in its current cluster state. When it receives a cluster state from the master with a dynamic minimum master nodes setting int it, we must make sure we don't reject it. Closes elastic#23695

… master (#23915) The purpose of this validation is to make sure that the master doesn't step down due to a change in master nodes, which also means that there is no way to revert an accidental change. Since we validate using the current cluster state (and not the one from which the settings come from) we have to be careful and only validate if the local node is already a master. Doing so all the time causes subtle issues. For example, a node that joins a cluster has no nodes in its current cluster state. When it receives a cluster state from the master with a dynamic minimum master nodes setting int it, we must make sure we don't reject it. Closes #23695

martijnvg · 2017-04-06T08:18:50Z

This test failed again after @bleskes fix was applied: https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+master+java9-periodic/2051/console

The failure is different, but I want to make sure that it has nothing do with the original failure or the fix itself before opening a new issue.

FAILURE 11.0s J0 | TribeIT.testClusterStateNodes <<< FAILURES!
   > Throwable #1: java.lang.AssertionError: 
   > Expected: iterable over ["node_sm3", "node_sm3/xYJvJwAAQAC2eure_____w", "cluster_10", "cluster_11"] in any order
   >      but: No item matches: "node_sm3/xYJvJwAAQAC2eure_____w", "cluster_10", "cluster_11" in ["node_sm3"]
   > 	at __randomizedtesting.SeedInfo.seed([7A44EACBF5B6FDAE:4E005C96EDD99B29]:0)
   > 	at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:20)
   > 	at org.elasticsearch.tribe.TribeIT.lambda$assertNodes$28(TribeIT.java:648)
   > 	at org.elasticsearch.test.ESTestCase.assertBusy(ESTestCase.java:715)
   > 	at org.elasticsearch.test.ESTestCase.assertBusy(ESTestCase.java:689)
   > 	at org.elasticsearch.tribe.TribeIT.assertNodes(TribeIT.java:645)
   > 	at org.elasticsearch.tribe.TribeIT.testClusterStateNodes(TribeIT.java:484)
   > 	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   > 	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
   > 	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   > 	at java.base/java.lang.reflect.Method.invoke(Method.java:547)
   > 	at java.base/java.lang.Thread.run(Thread.java:844)

abeyad · 2017-04-13T21:08:05Z

Another failure today:

FAILURE 11.6s J1 | TribeIT.testClusterStateNodes <<< FAILURES!
   > Throwable #1: java.lang.AssertionError: 
   > Expected: iterable over ["node_sm3/jBfHUQAAQACHJdt7AAAAAA", "cluster_12", "cluster_2d4", "cluster_2d3", "cluster_10", "cluster_11", "cluster_2d5", "cluster_2m1", "node_sm3", "cluster_2m0", "node_sm3/h1AqCAAAQACMK6UyAAAAAA", "cluster_2m2"] in any order
   >      but: No item matches: "cluster_12", "cluster_10", "cluster_11", "node_sm3/h1AqCAAAQACMK6UyAAAAAA" in ["node_sm3/jBfHUQAAQACHJdt7AAAAAA", "cluster_2d4", "cluster_2d3", "cluster_2d5", "cluster_2m1", "node_sm3", "cluster_2m0", "cluster_2m2"]
   > 	at __randomizedtesting.SeedInfo.seed([4ED3B7B6E28F9E27:7A9701EBFAE0F8A0]:0)
   > 	at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:20)
   > 	at org.elasticsearch.tribe.TribeIT.lambda$assertNodes$28(TribeIT.java:648)
   > 	at org.elasticsearch.test.ESTestCase.assertBusy(ESTestCase.java:715)
   > 	at org.elasticsearch.test.ESTestCase.assertBusy(ESTestCase.java:689)
   > 	at org.elasticsearch.tribe.TribeIT.assertNodes(TribeIT.java:645)
   > 	at org.elasticsearch.tribe.TribeIT.testClusterStateNodes(TribeIT.java:484)
   > 	at java.lang.Thread.run(Thread.java:745)
   > 	Suppressed: java.lang.AssertionError: 
   > Expected: iterable over ["node_sm3/jBfHUQAAQACHJdt7AAAAAA", "cluster_12", "cluster_2d4", "cluster_2d3", "cluster_10", "cluster_11", "cluster_2d5", "cluster_2m1", "node_sm3", "cluster_2m0", "node_sm3/h1AqCAAAQACMK6UyAAAAAA", "cluster_2m2"] in any order
   >      but: No item matches: "cluster_12", "cluster_10", "cluster_11", "node_sm3/h1AqCAAAQACMK6UyAAAAAA" in ["node_sm3/jBfHUQAAQACHJdt7AAAAAA", "cluster_2d4", "cluster_2d3", "cluster_2d5", "cluster_2m1", "node_sm3", "cluster_2m0", "cluster_2m2"]
   > 		at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:20)
   > 		at org.elasticsearch.tribe.TribeIT.lambda$assertNodes$28(TribeIT.java:648)
   > 		at org.elasticsearch.test.ESTestCase.assertBusy(ESTestCase.java:703)
   > 		... 40 more
   > 	Suppressed: java.lang.AssertionError:

See https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+master+multijob-unix-compatibility/os=debian/839/consoleFull

abeyad · 2017-04-13T21:08:37Z

Reopening as the issue is still failing
/cc @bleskes

The tribe was being shutdown by the test while a publishing round (that adds the tribe node to a cluster) is not completed yet (i.e. the node itself knows that it became part of the cluster, and the test shuts the tribe node down, but another node has not applied the cluster state yet, which makes that node hang while trying to connect to the node that is shutting down (due to connect_timeout being 30 seconds), delaying publishing for 30 seconds, and subsequently tripping an assertion when another tribe instance wants to join. Relates to #23695

ywelsch · 2017-04-24T10:34:50Z

I've looked into this failure (which is unrelated to the first one that was fixed here). What I think is happening here is that the tribe node node_sm0/LfM1LgAAQACUzI9-AAAAAA is being shutdown by the test while a publishing round (that adds the node to cluster_2) is not completed yet (i.e. the node itself knows that it became part of the cluster, and the test shuts the tribe node down, but another node (cluster_2m1) has not applied the cluster state yet, which makes that node hang while trying to connect to the node that is shutting down (due to connect_timeout being 30 seconds), delaying publishing for 30 seconds.
This trips an assertion, as the test subsequently starts another tribe node that also wants to join the cluster, and has only 10 seconds to join said cluster before the assertion trips. The master is blocked however for 30 seconds to complete the previous publishing round.

Relevant log lines:

1> [2017-04-18T02:48:31,991][INFO ][o.e.c.s.ClusterService   ] [cluster_2m1] added {{node_sm0/LfM1LgAAQACUzI9-AAAAAA}{GBlwfM8rSk2dE1mQ3nJJ5A}{ONCmCnYQReK35DcZ13C60g}{127.0.0.1}{127.0.0.1:9300},}, reason: zen-disco-receive(from master [master {cluster_2m2}{CzEgKG_WQy6cAFNA9B31vw}{HNBvClr0T0SNeetgGZJHOA}{127.0.0.1}{127.0.0.1:9540} committed version [52]])


1> [2017-04-18T02:49:01,990][WARN ][o.e.d.z.PublishClusterStateAction] [cluster_2m2] timed out waiting for all nodes to process published state [52] (timeout [30s], pending nodes: [{cluster_2m1}{EGHcShPYTaSMgSmJBv8-ig}{suB_qk87R1esfD5HnJEhhQ}{127.0.0.1}{127.0.0.1:9544}])
  1> [2017-04-18T02:49:01,992][WARN ][o.e.c.s.ClusterService   ] [cluster_2m2] cluster state update task [zen-disco-node-join[{node_sm0/LfM1LgAAQACUzI9-AAAAAA}{GBlwfM8rSk2dE1mQ3nJJ5A}{ONCmCnYQReK35DcZ13C60g}{127.0.0.1}{127.0.0.1:9300}]] took [30s] above the warn threshold of 30s
  1> [2017-04-18T02:49:01,994][INFO ][o.e.c.s.ClusterService   ] [cluster_2m2] removed {{node_sm0/LfM1LgAAQACUzI9-AAAAAA}{GBlwfM8rSk2dE1mQ3nJJ5A}{ONCmCnYQReK35DcZ13C60g}{127.0.0.1}{127.0.0.1:9300},}, reason: zen-disco-node-left({node_sm0/LfM1LgAAQACUzI9-AAAAAA}{GBlwfM8rSk2dE1mQ3nJJ5A}{ONCmCnYQReK35DcZ13C60g}{127.0.0.1}{127.0.0.1:9300}), reason(left)[{node_sm0/LfM1LgAAQACUzI9-AAAAAA}{GBlwfM8rSk2dE1mQ3nJJ5A}{ONCmCnYQReK35DcZ13C60g}{127.0.0.1}{127.0.0.1:9300} left], zen-disco-node-failed({node_sm0/LfM1LgAAQACUzI9-AAAAAA}{GBlwfM8rSk2dE1mQ3nJJ5A}{ONCmCnYQReK35DcZ13C60g}{127.0.0.1}{127.0.0.1:9300}), reason(transport disconnected)[{node_sm0/LfM1LgAAQACUzI9-AAAAAA}{GBlwfM8rSk2dE1mQ3nJJ5A}{ONCmCnYQReK35DcZ13C60g}{127.0.0.1}{127.0.0.1:9300} transport disconnected]


1> [2017-04-18T02:49:02,013][WARN ][o.e.c.NodeConnectionsService] [cluster_2m1] failed to connect to node {node_sm0/LfM1LgAAQACUzI9-AAAAAA}{GBlwfM8rSk2dE1mQ3nJJ5A}{ONCmCnYQReK35DcZ13C60g}{127.0.0.1}{127.0.0.1:9300} (tried [1] times)
  1> org.elasticsearch.transport.ConnectTransportException: [node_sm0/LfM1LgAAQACUzI9-AAAAAA][127.0.0.1:9300] general node connection failure
  1> 	at org.elasticsearch.transport.TcpTransport.connectToNode(TcpTransport.java:471) ~[main/:?]
  1> 	at org.elasticsearch.transport.TransportService.connectToNode(TransportService.java:311) ~[main/:?]
  1> 	at org.elasticsearch.transport.TransportService.connectToNode(TransportService.java:298) ~[main/:?]
  1> 	at org.elasticsearch.cluster.NodeConnectionsService.validateAndConnectIfNeeded(NodeConnectionsService.java:154) ~[main/:?]
  1> 	at org.elasticsearch.cluster.NodeConnectionsService$1.doRun(NodeConnectionsService.java:107) ~[main/:?]
  1> 	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:638) ~[main/:?]
  1> 	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[main/:?]
  1> 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) ~[?:1.8.0_121]
  1> 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) ~[?:1.8.0_121]
  1> 	at java.lang.Thread.run(Thread.java:745) ~[?:1.8.0_121]
  1> Caused by: java.lang.IllegalStateException: handshake failed with {node_sm0/LfM1LgAAQACUzI9-AAAAAA}{GBlwfM8rSk2dE1mQ3nJJ5A}{ONCmCnYQReK35DcZ13C60g}{127.0.0.1}{127.0.0.1:9300}
  1> 	at org.elasticsearch.transport.TransportService.handshake(TransportService.java:382) ~[main/:?]
  1> 	at org.elasticsearch.transport.TransportService.lambda$connectToNode$3(TransportService.java:313) ~[main/:?]
  1> 	at org.elasticsearch.transport.TcpTransport.connectToNode(TcpTransport.java:454) ~[main/:?]
  1> 	... 9 more
  1> Caused by: org.elasticsearch.transport.ReceiveTimeoutTransportException: [node_sm0/LfM1LgAAQACUzI9-AAAAAA][127.0.0.1:9300][internal:transport/handshake] request_id [25] timed out after [30000ms]
  1> 	at org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:921) ~[main/:?]
  1> 	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:569) ~[main/:?]
  1> 	... 3 more
  1> [2017-04-18T02:49:02,016][WARN ][o.e.c.s.ClusterService   ] [cluster_2m1] cluster state update task [zen-disco-receive(from master [master {cluster_2m2}{CzEgKG_WQy6cAFNA9B31vw}{HNBvClr0T0SNeetgGZJHOA}{127.0.0.1}{127.0.0.1:9540} committed version [52]])] took [30s] above the warn threshold of 30s
  1> [2017-04-18T02:49:02,016][INFO ][o.e.c.s.ClusterService   ] [cluster_2m1] removed {{node_sm0/LfM1LgAAQACUzI9-AAAAAA}{GBlwfM8rSk2dE1mQ3nJJ5A}{ONCmCnYQReK35DcZ13C60g}{127.0.0.1}{127.0.0.1:9300},}, reason: zen-disco-receive(from master [master {cluster_2m2}{CzEgKG_WQy6cAFNA9B31vw}{HNBvClr0T0SNeetgGZJHOA}{127.0.0.1}{127.0.0.1:9540} committed version [53]])

I've pushed 7c39507 to address this.

The tribe was being shutdown by the test while a publishing round (that adds the tribe node to a cluster) is not completed yet (i.e. the node itself knows that it became part of the cluster, and the test shuts the tribe node down, but another node has not applied the cluster state yet, which makes that node hang while trying to connect to the node that is shutting down (due to connect_timeout being 30 seconds), delaying publishing for 30 seconds, and subsequently tripping an assertion when another tribe instance wants to join. Relates to #23695

cbuescher added :Distributed/Discovery-Plugins Anything related to our integration plugins with EC2, GCP and Azure >test-failure Triaged test failures from CI >test Issues or PRs that are addressing/adding tests labels Mar 22, 2017

cbuescher assigned ywelsch Mar 22, 2017

bleskes mentioned this issue Apr 5, 2017

ZenDiscovery - only validate min_master_nodes values if local node is master #23915

Merged

bleskes closed this as completed in #23915 Apr 5, 2017

abeyad reopened this Apr 13, 2017

ywelsch closed this as completed Apr 24, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CI: TribeIT#testClusterStateNodes fails with "can't update [discovery.zen.minimum_master_nodes]" #23695

CI: TribeIT#testClusterStateNodes fails with "can't update [discovery.zen.minimum_master_nodes]" #23695

cbuescher commented Mar 22, 2017

cbuescher commented Mar 22, 2017

cbuescher commented Mar 22, 2017

ywelsch commented Mar 22, 2017

Tim-Brooks commented Apr 4, 2017

martijnvg commented Apr 6, 2017

abeyad commented Apr 13, 2017

abeyad commented Apr 13, 2017

ywelsch commented Apr 24, 2017

CI: TribeIT#testClusterStateNodes fails with "can't update [discovery.zen.minimum_master_nodes]" #23695

CI: TribeIT#testClusterStateNodes fails with "can't update [discovery.zen.minimum_master_nodes]" #23695

Comments

cbuescher commented Mar 22, 2017

cbuescher commented Mar 22, 2017

cbuescher commented Mar 22, 2017

ywelsch commented Mar 22, 2017

Tim-Brooks commented Apr 4, 2017

martijnvg commented Apr 6, 2017

abeyad commented Apr 13, 2017

abeyad commented Apr 13, 2017

ywelsch commented Apr 24, 2017