Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CI: TribeIT#testClusterStateNodes fails with "can't update [discovery.zen.minimum_master_nodes]" #23695

Closed
cbuescher opened this issue Mar 22, 2017 · 8 comments
Assignees
Labels
:Distributed/Discovery-Plugins Anything related to our integration plugins with EC2, GCP and Azure >test Issues or PRs that are addressing/adding tests >test-failure Triaged test failures from CI

Comments

@cbuescher
Copy link
Member

Build: https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+master+multijob-intake/974

There are several of these failures in the logs:

[node_sm0/AmCrGgAAQACdSFhqAAAAAA] failed to prepareCommit settings for [Updater for: {
  1>   "key" : "discovery.zen.minimum_master_nodes",
  1>   "properties" : [
  1>     "Dynamic",
  1>     "NodeScope"
  1>   ],
  1>   "is_group_setting" : false,
  1>   "default" : "-1"
  1> }]
  1> java.lang.IllegalArgumentException: illegal value can't update [discovery.zen.minimum_master_nodes] from [-1] to [3]
  1> 	at org.elasticsearch.common.settings.Setting$Updater.getValue(Setting.java:594) ~[main/:?]
  1> 	at org.elasticsearch.common.settings.AbstractScopedSettings$SettingUpdater.updater(AbstractScopedSettings.java:346) ~[main/:?]
  1> 	at org.elasticsearch.common.settings.AbstractScopedSettings.applySettings(AbstractScopedSettings.java:166) ~[main/:?]
  1> 	at org.elasticsearch.cluster.service.ClusterService.publishAndApplyChanges(ClusterService.java:808) ~[main/:?]
  1> 	at org.elasticsearch.cluster.service.ClusterService.runTasks(ClusterService.java:633) ~[main/:?]
  1> 	at org.elasticsearch.cluster.service.ClusterService$UpdateTask.run(ClusterService.java:1117) ~[main/:?]
  1> 	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:544) ~[main/:?]
  1> 	at 
@cbuescher cbuescher added :Distributed/Discovery-Plugins Anything related to our integration plugins with EC2, GCP and Azure >test-failure Triaged test failures from CI >test Issues or PRs that are addressing/adding tests labels Mar 22, 2017
@cbuescher
Copy link
Member Author

Couldn't reproduce, but here's the line in any case:

gradle :core:integTest -Dtests.seed=CB45553B395B1BE4 -Dtests.class=org.elasticsearch.tribe.TribeIT -Dtests.method="testClusterStateNodes" -Dtests.security.manager=true -Dtests.locale=es-UY -Dtests.timezone=Asia/Bangkok

@cbuescher
Copy link
Member Author

@ywelsch can you take a look at this and re-assign if needed?

@ywelsch
Copy link
Contributor

ywelsch commented Mar 22, 2017

This exhibits a real issue (but I'm not sure yet why the test fails) with the minimum_master_nodes dynamic setting update check.

To reproduce the issue:

  • Start 2 nodes
  • Dynamically update cluster settings by setting minimum_master_nodes to 2
  • Start another node
  • When that third node joins the cluster, it will throw said exception because it will check if the number of master nodes in the previous cluster state is at least as high as the minimum_master_nodes setting in the new cluster state. The number of master nodes in the previous cluster state is 0 (cluster state before the node joins the cluster) and the setting is 2.
Caused by: java.lang.IllegalArgumentException: cannot set discovery.zen.minimum_master_nodes to more than the current master nodes count [0]
	at org.elasticsearch.discovery.zen.ZenDiscovery.lambda$new$2(ZenDiscovery.java:176) ~[main/:?]
	at org.elasticsearch.common.settings.Setting$Updater.getValue(Setting.java:592) ~[main/:?]
	... 11 more

What's bad about this is that the node will not apply any of the other cluster settings in that cluster state publishing round.

/cc: @bleskes

@Tim-Brooks
Copy link
Contributor

It looks like there was a failure related to this today. On the G1Gc master build.

https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+master+g1gc/2327/console

gradle :core:integTest -Dtests.seed=8C71214509D0E0F4 -Dtests.class=org.elasticsearch.tribe.TribeIT -Dtests.method="testClusterStateNodes" -Dtests.security.manager=true -Dtests.jvm.argline="-XX:-UseConcMarkSweepGC -XX:+UseG1GC" -Dtests.locale=es-PE -Dtests.timezone=Africa/Gaborone

bleskes added a commit to bleskes/elasticsearch that referenced this issue Apr 5, 2017
…s master

The purpose of this validation is to make sure that the master doesn't step down
due to a change in master nodes, which also means that there is no way to revert
an accidental change. Since we validate using the current cluster state (and
not the one from which the settings come from) we have to be careful and only
validate if the local node is already a master. Doing so all the time causes
subtle issues. For example, a node that joins a cluster has no nodes in its
current cluster state. When it receives a cluster state from the master with
a dynamic minimum master nodes setting int it, we must make sure we don't reject it.

Closes elastic#23695
bleskes added a commit that referenced this issue Apr 5, 2017
… master (#23915)

The purpose of this validation is to make sure that the master doesn't step down
due to a change in master nodes, which also means that there is no way to revert
an accidental change. Since we validate using the current cluster state (and
not the one from which the settings come from) we have to be careful and only
validate if the local node is already a master. Doing so all the time causes
subtle issues. For example, a node that joins a cluster has no nodes in its
current cluster state. When it receives a cluster state from the master with
a dynamic minimum master nodes setting int it, we must make sure we don't reject it.

Closes #23695
bleskes added a commit that referenced this issue Apr 5, 2017
… master (#23915)

The purpose of this validation is to make sure that the master doesn't step down
due to a change in master nodes, which also means that there is no way to revert
an accidental change. Since we validate using the current cluster state (and
not the one from which the settings come from) we have to be careful and only
validate if the local node is already a master. Doing so all the time causes
subtle issues. For example, a node that joins a cluster has no nodes in its
current cluster state. When it receives a cluster state from the master with
a dynamic minimum master nodes setting int it, we must make sure we don't reject it.

Closes #23695
bleskes added a commit that referenced this issue Apr 5, 2017
… master (#23915)

The purpose of this validation is to make sure that the master doesn't step down
due to a change in master nodes, which also means that there is no way to revert
an accidental change. Since we validate using the current cluster state (and
not the one from which the settings come from) we have to be careful and only
validate if the local node is already a master. Doing so all the time causes
subtle issues. For example, a node that joins a cluster has no nodes in its
current cluster state. When it receives a cluster state from the master with
a dynamic minimum master nodes setting int it, we must make sure we don't reject it.

Closes #23695
@martijnvg
Copy link
Member

This test failed again after @bleskes fix was applied: https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+master+java9-periodic/2051/console

The failure is different, but I want to make sure that it has nothing do with the original failure or the fix itself before opening a new issue.

FAILURE 11.0s J0 | TribeIT.testClusterStateNodes <<< FAILURES!
   > Throwable #1: java.lang.AssertionError: 
   > Expected: iterable over ["node_sm3", "node_sm3/xYJvJwAAQAC2eure_____w", "cluster_10", "cluster_11"] in any order
   >      but: No item matches: "node_sm3/xYJvJwAAQAC2eure_____w", "cluster_10", "cluster_11" in ["node_sm3"]
   > 	at __randomizedtesting.SeedInfo.seed([7A44EACBF5B6FDAE:4E005C96EDD99B29]:0)
   > 	at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:20)
   > 	at org.elasticsearch.tribe.TribeIT.lambda$assertNodes$28(TribeIT.java:648)
   > 	at org.elasticsearch.test.ESTestCase.assertBusy(ESTestCase.java:715)
   > 	at org.elasticsearch.test.ESTestCase.assertBusy(ESTestCase.java:689)
   > 	at org.elasticsearch.tribe.TribeIT.assertNodes(TribeIT.java:645)
   > 	at org.elasticsearch.tribe.TribeIT.testClusterStateNodes(TribeIT.java:484)
   > 	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   > 	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
   > 	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   > 	at java.base/java.lang.reflect.Method.invoke(Method.java:547)
   > 	at java.base/java.lang.Thread.run(Thread.java:844)

@abeyad
Copy link

abeyad commented Apr 13, 2017

Another failure today:

FAILURE 11.6s J1 | TribeIT.testClusterStateNodes <<< FAILURES!
   > Throwable #1: java.lang.AssertionError: 
   > Expected: iterable over ["node_sm3/jBfHUQAAQACHJdt7AAAAAA", "cluster_12", "cluster_2d4", "cluster_2d3", "cluster_10", "cluster_11", "cluster_2d5", "cluster_2m1", "node_sm3", "cluster_2m0", "node_sm3/h1AqCAAAQACMK6UyAAAAAA", "cluster_2m2"] in any order
   >      but: No item matches: "cluster_12", "cluster_10", "cluster_11", "node_sm3/h1AqCAAAQACMK6UyAAAAAA" in ["node_sm3/jBfHUQAAQACHJdt7AAAAAA", "cluster_2d4", "cluster_2d3", "cluster_2d5", "cluster_2m1", "node_sm3", "cluster_2m0", "cluster_2m2"]
   > 	at __randomizedtesting.SeedInfo.seed([4ED3B7B6E28F9E27:7A9701EBFAE0F8A0]:0)
   > 	at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:20)
   > 	at org.elasticsearch.tribe.TribeIT.lambda$assertNodes$28(TribeIT.java:648)
   > 	at org.elasticsearch.test.ESTestCase.assertBusy(ESTestCase.java:715)
   > 	at org.elasticsearch.test.ESTestCase.assertBusy(ESTestCase.java:689)
   > 	at org.elasticsearch.tribe.TribeIT.assertNodes(TribeIT.java:645)
   > 	at org.elasticsearch.tribe.TribeIT.testClusterStateNodes(TribeIT.java:484)
   > 	at java.lang.Thread.run(Thread.java:745)
   > 	Suppressed: java.lang.AssertionError: 
   > Expected: iterable over ["node_sm3/jBfHUQAAQACHJdt7AAAAAA", "cluster_12", "cluster_2d4", "cluster_2d3", "cluster_10", "cluster_11", "cluster_2d5", "cluster_2m1", "node_sm3", "cluster_2m0", "node_sm3/h1AqCAAAQACMK6UyAAAAAA", "cluster_2m2"] in any order
   >      but: No item matches: "cluster_12", "cluster_10", "cluster_11", "node_sm3/h1AqCAAAQACMK6UyAAAAAA" in ["node_sm3/jBfHUQAAQACHJdt7AAAAAA", "cluster_2d4", "cluster_2d3", "cluster_2d5", "cluster_2m1", "node_sm3", "cluster_2m0", "cluster_2m2"]
   > 		at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:20)
   > 		at org.elasticsearch.tribe.TribeIT.lambda$assertNodes$28(TribeIT.java:648)
   > 		at org.elasticsearch.test.ESTestCase.assertBusy(ESTestCase.java:703)
   > 		... 40 more
   > 	Suppressed: java.lang.AssertionError: 

See https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+master+multijob-unix-compatibility/os=debian/839/consoleFull

@abeyad
Copy link

abeyad commented Apr 13, 2017

Reopening as the issue is still failing
/cc @bleskes

@abeyad abeyad reopened this Apr 13, 2017
ywelsch added a commit that referenced this issue Apr 24, 2017
The tribe was being shutdown by the test while a publishing round (that adds the tribe node to a cluster) is not completed yet (i.e. the node itself
knows that it became part of the cluster, and the test shuts the tribe node down, but another node has not applied the cluster state yet, which makes
that node hang while trying to connect to the node that is shutting down (due to connect_timeout being 30 seconds), delaying publishing for 30
seconds, and subsequently tripping an assertion when another tribe instance wants to join.

Relates to #23695
@ywelsch
Copy link
Contributor

ywelsch commented Apr 24, 2017

I've looked into this failure (which is unrelated to the first one that was fixed here). What I think is happening here is that the tribe node node_sm0/LfM1LgAAQACUzI9-AAAAAA is being shutdown by the test while a publishing round (that adds the node to cluster_2) is not completed yet (i.e. the node itself knows that it became part of the cluster, and the test shuts the tribe node down, but another node (cluster_2m1) has not applied the cluster state yet, which makes that node hang while trying to connect to the node that is shutting down (due to connect_timeout being 30 seconds), delaying publishing for 30 seconds.
This trips an assertion, as the test subsequently starts another tribe node that also wants to join the cluster, and has only 10 seconds to join said cluster before the assertion trips. The master is blocked however for 30 seconds to complete the previous publishing round.

Relevant log lines:

1> [2017-04-18T02:48:31,991][INFO ][o.e.c.s.ClusterService   ] [cluster_2m1] added {{node_sm0/LfM1LgAAQACUzI9-AAAAAA}{GBlwfM8rSk2dE1mQ3nJJ5A}{ONCmCnYQReK35DcZ13C60g}{127.0.0.1}{127.0.0.1:9300},}, reason: zen-disco-receive(from master [master {cluster_2m2}{CzEgKG_WQy6cAFNA9B31vw}{HNBvClr0T0SNeetgGZJHOA}{127.0.0.1}{127.0.0.1:9540} committed version [52]])


1> [2017-04-18T02:49:01,990][WARN ][o.e.d.z.PublishClusterStateAction] [cluster_2m2] timed out waiting for all nodes to process published state [52] (timeout [30s], pending nodes: [{cluster_2m1}{EGHcShPYTaSMgSmJBv8-ig}{suB_qk87R1esfD5HnJEhhQ}{127.0.0.1}{127.0.0.1:9544}])
  1> [2017-04-18T02:49:01,992][WARN ][o.e.c.s.ClusterService   ] [cluster_2m2] cluster state update task [zen-disco-node-join[{node_sm0/LfM1LgAAQACUzI9-AAAAAA}{GBlwfM8rSk2dE1mQ3nJJ5A}{ONCmCnYQReK35DcZ13C60g}{127.0.0.1}{127.0.0.1:9300}]] took [30s] above the warn threshold of 30s
  1> [2017-04-18T02:49:01,994][INFO ][o.e.c.s.ClusterService   ] [cluster_2m2] removed {{node_sm0/LfM1LgAAQACUzI9-AAAAAA}{GBlwfM8rSk2dE1mQ3nJJ5A}{ONCmCnYQReK35DcZ13C60g}{127.0.0.1}{127.0.0.1:9300},}, reason: zen-disco-node-left({node_sm0/LfM1LgAAQACUzI9-AAAAAA}{GBlwfM8rSk2dE1mQ3nJJ5A}{ONCmCnYQReK35DcZ13C60g}{127.0.0.1}{127.0.0.1:9300}), reason(left)[{node_sm0/LfM1LgAAQACUzI9-AAAAAA}{GBlwfM8rSk2dE1mQ3nJJ5A}{ONCmCnYQReK35DcZ13C60g}{127.0.0.1}{127.0.0.1:9300} left], zen-disco-node-failed({node_sm0/LfM1LgAAQACUzI9-AAAAAA}{GBlwfM8rSk2dE1mQ3nJJ5A}{ONCmCnYQReK35DcZ13C60g}{127.0.0.1}{127.0.0.1:9300}), reason(transport disconnected)[{node_sm0/LfM1LgAAQACUzI9-AAAAAA}{GBlwfM8rSk2dE1mQ3nJJ5A}{ONCmCnYQReK35DcZ13C60g}{127.0.0.1}{127.0.0.1:9300} transport disconnected]


1> [2017-04-18T02:49:02,013][WARN ][o.e.c.NodeConnectionsService] [cluster_2m1] failed to connect to node {node_sm0/LfM1LgAAQACUzI9-AAAAAA}{GBlwfM8rSk2dE1mQ3nJJ5A}{ONCmCnYQReK35DcZ13C60g}{127.0.0.1}{127.0.0.1:9300} (tried [1] times)
  1> org.elasticsearch.transport.ConnectTransportException: [node_sm0/LfM1LgAAQACUzI9-AAAAAA][127.0.0.1:9300] general node connection failure
  1> 	at org.elasticsearch.transport.TcpTransport.connectToNode(TcpTransport.java:471) ~[main/:?]
  1> 	at org.elasticsearch.transport.TransportService.connectToNode(TransportService.java:311) ~[main/:?]
  1> 	at org.elasticsearch.transport.TransportService.connectToNode(TransportService.java:298) ~[main/:?]
  1> 	at org.elasticsearch.cluster.NodeConnectionsService.validateAndConnectIfNeeded(NodeConnectionsService.java:154) ~[main/:?]
  1> 	at org.elasticsearch.cluster.NodeConnectionsService$1.doRun(NodeConnectionsService.java:107) ~[main/:?]
  1> 	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:638) ~[main/:?]
  1> 	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[main/:?]
  1> 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) ~[?:1.8.0_121]
  1> 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) ~[?:1.8.0_121]
  1> 	at java.lang.Thread.run(Thread.java:745) ~[?:1.8.0_121]
  1> Caused by: java.lang.IllegalStateException: handshake failed with {node_sm0/LfM1LgAAQACUzI9-AAAAAA}{GBlwfM8rSk2dE1mQ3nJJ5A}{ONCmCnYQReK35DcZ13C60g}{127.0.0.1}{127.0.0.1:9300}
  1> 	at org.elasticsearch.transport.TransportService.handshake(TransportService.java:382) ~[main/:?]
  1> 	at org.elasticsearch.transport.TransportService.lambda$connectToNode$3(TransportService.java:313) ~[main/:?]
  1> 	at org.elasticsearch.transport.TcpTransport.connectToNode(TcpTransport.java:454) ~[main/:?]
  1> 	... 9 more
  1> Caused by: org.elasticsearch.transport.ReceiveTimeoutTransportException: [node_sm0/LfM1LgAAQACUzI9-AAAAAA][127.0.0.1:9300][internal:transport/handshake] request_id [25] timed out after [30000ms]
  1> 	at org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:921) ~[main/:?]
  1> 	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:569) ~[main/:?]
  1> 	... 3 more
  1> [2017-04-18T02:49:02,016][WARN ][o.e.c.s.ClusterService   ] [cluster_2m1] cluster state update task [zen-disco-receive(from master [master {cluster_2m2}{CzEgKG_WQy6cAFNA9B31vw}{HNBvClr0T0SNeetgGZJHOA}{127.0.0.1}{127.0.0.1:9540} committed version [52]])] took [30s] above the warn threshold of 30s
  1> [2017-04-18T02:49:02,016][INFO ][o.e.c.s.ClusterService   ] [cluster_2m1] removed {{node_sm0/LfM1LgAAQACUzI9-AAAAAA}{GBlwfM8rSk2dE1mQ3nJJ5A}{ONCmCnYQReK35DcZ13C60g}{127.0.0.1}{127.0.0.1:9300},}, reason: zen-disco-receive(from master [master {cluster_2m2}{CzEgKG_WQy6cAFNA9B31vw}{HNBvClr0T0SNeetgGZJHOA}{127.0.0.1}{127.0.0.1:9540} committed version [53]])

I've pushed 7c39507 to address this.

@ywelsch ywelsch closed this as completed Apr 24, 2017
ywelsch added a commit that referenced this issue Apr 24, 2017
The tribe was being shutdown by the test while a publishing round (that adds the tribe node to a cluster) is not completed yet (i.e. the node itself
knows that it became part of the cluster, and the test shuts the tribe node down, but another node has not applied the cluster state yet, which makes
that node hang while trying to connect to the node that is shutting down (due to connect_timeout being 30 seconds), delaying publishing for 30
seconds, and subsequently tripping an assertion when another tribe instance wants to join.

Relates to #23695
ywelsch added a commit that referenced this issue Apr 24, 2017
The tribe was being shutdown by the test while a publishing round (that adds the tribe node to a cluster) is not completed yet (i.e. the node itself
knows that it became part of the cluster, and the test shuts the tribe node down, but another node has not applied the cluster state yet, which makes
that node hang while trying to connect to the node that is shutting down (due to connect_timeout being 30 seconds), delaying publishing for 30
seconds, and subsequently tripping an assertion when another tribe instance wants to join.

Relates to #23695
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed/Discovery-Plugins Anything related to our integration plugins with EC2, GCP and Azure >test Issues or PRs that are addressing/adding tests >test-failure Triaged test failures from CI
Projects
None yet
Development

No branches or pull requests

5 participants