New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CI] TokenBackwardsCompatibilityIT failure in 6.x #37379

Open
gwbrown opened this Issue Jan 11, 2019 · 10 comments

Comments

Projects
None yet
4 participants
@gwbrown
Copy link
Contributor

gwbrown commented Jan 11, 2019

This failure appears to be pretty rare, but happened twice in 6.x intake this morning:
https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+6.x+intake/960/console
https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+6.x+intake/966/console (maybe? This one might be caused by something else) [edit: Don't think this one is connected, see below]

Reproduce line (I'm having trouble with my local environment ATM so I can't confirm whether this reproduces but I don't think so):

./gradlew :x-pack:qa:rolling-upgrade:without-system-key:v5.6.15#twoThirdsUpgradedTestRunner \
  -Dtests.seed=9E7FA2C9D621808E \
  -Dtests.class=org.elasticsearch.upgrades.TokenBackwardsCompatibilityIT \
  -Dtests.method="testMixedCluster" \
  -Dtests.security.manager=true \
  -Dtests.locale=ar-BH \
  -Dtests.timezone=CET \
  -Dcompiler.java=11 \
  -Druntime.java=8

Stack trace:

org.elasticsearch.client.ResponseException: method [POST], host [http://127.0.0.1:45085], URI [_xpack/security/oauth2/token], status line [HTTP/1.1 400 Bad Request]
{"error":"invalid_grant","error_description":"token has already been refreshed"}
	at org.elasticsearch.client.RestClient$SyncResponseListener.get(RestClient.java:929)
	at org.elasticsearch.client.RestClient.performRequest(RestClient.java:229)
	at org.elasticsearch.upgrades.TokenBackwardsCompatibilityIT.testMixedCluster(TokenBackwardsCompatibilityIT.java:114)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at com.carrotsearch.randomizedtesting.RandomizedRunner.invoke(RandomizedRunner.java:1750)
	at com.carrotsearch.randomizedtesting.RandomizedRunner$8.evaluate(RandomizedRunner.java:938)
	at com.carrotsearch.randomizedtesting.RandomizedRunner$9.evaluate(RandomizedRunner.java:974)
	at com.carrotsearch.randomizedtesting.RandomizedRunner$10.evaluate(RandomizedRunner.java:988)
	at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
	at org.apache.lucene.util.TestRuleSetupTeardownChained$1.evaluate(TestRuleSetupTeardownChained.java:49)
	at org.apache.lucene.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:45)
	at org.apache.lucene.util.TestRuleThreadAndTestName$1.evaluate(TestRuleThreadAndTestName.java:48)
	at org.apache.lucene.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:64)
	at org.apache.lucene.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:47)
	at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
	at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:368)
	at com.carrotsearch.randomizedtesting.ThreadLeakControl.forkTimeoutingTask(ThreadLeakControl.java:817)
	at com.carrotsearch.randomizedtesting.ThreadLeakControl$3.evaluate(ThreadLeakControl.java:468)
	at com.carrotsearch.randomizedtesting.RandomizedRunner.runSingleTest(RandomizedRunner.java:947)
	at com.carrotsearch.randomizedtesting.RandomizedRunner$5.evaluate(RandomizedRunner.java:832)
	at com.carrotsearch.randomizedtesting.RandomizedRunner$6.evaluate(RandomizedRunner.java:883)
	at com.carrotsearch.randomizedtesting.RandomizedRunner$7.evaluate(RandomizedRunner.java:894)
	at org.apache.lucene.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:45)
	at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
	at org.apache.lucene.util.TestRuleStoreClassName$1.evaluate(TestRuleStoreClassName.java:41)
	at com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
	at com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
	at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
	at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
	at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
	at org.apache.lucene.util.TestRuleAssertionsRequired$1.evaluate(TestRuleAssertionsRequired.java:53)
	at org.apache.lucene.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:47)
	at org.apache.lucene.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:64)
	at org.apache.lucene.util.TestRuleIgnoreTestSuites$1.evaluate(TestRuleIgnoreTestSuites.java:54)
	at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
	at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:368)
	at java.lang.Thread.run(Thread.java:748)
Caused by: org.elasticsearch.client.ResponseException: method [POST], host [http://127.0.0.1:45085], URI [_xpack/security/oauth2/token], status line [HTTP/1.1 400 Bad Request]
{"error":"invalid_grant","error_description":"token has already been refreshed"}
	at org.elasticsearch.client.RestClient$1.completed(RestClient.java:548)
	at org.elasticsearch.client.RestClient$1.completed(RestClient.java:533)
	at org.apache.http.concurrent.BasicFuture.completed(BasicFuture.java:119)
	at org.apache.http.impl.nio.client.DefaultClientExchangeHandlerImpl.responseCompleted(DefaultClientExchangeHandlerImpl.java:177)
	at org.apache.http.nio.protocol.HttpAsyncRequestExecutor.processResponse(HttpAsyncRequestExecutor.java:436)
	at org.apache.http.nio.protocol.HttpAsyncRequestExecutor.inputReady(HttpAsyncRequestExecutor.java:326)
	at org.apache.http.impl.nio.DefaultNHttpClientConnection.consumeInput(DefaultNHttpClientConnection.java:265)
	at org.apache.http.impl.nio.client.InternalIODispatch.onInputReady(InternalIODispatch.java:81)
	at org.apache.http.impl.nio.client.InternalIODispatch.onInputReady(InternalIODispatch.java:39)
	at org.apache.http.impl.nio.reactor.AbstractIODispatch.inputReady(AbstractIODispatch.java:114)
	at org.apache.http.impl.nio.reactor.BaseIOReactor.readable(BaseIOReactor.java:162)
	at org.apache.http.impl.nio.reactor.AbstractIOReactor.processEvent(AbstractIOReactor.java:337)
	at org.apache.http.impl.nio.reactor.AbstractIOReactor.processEvents(AbstractIOReactor.java:315)
	at org.apache.http.impl.nio.reactor.AbstractIOReactor.execute(AbstractIOReactor.java:276)
	at org.apache.http.impl.nio.reactor.BaseIOReactor.execute(BaseIOReactor.java:104)
	at org.apache.http.impl.nio.reactor.AbstractMultiworkerIOReactor$Worker.run(AbstractMultiworkerIOReactor.java:588)
	... 1 more
	Suppressed: org.apache.http.ConnectionClosedException: Connection closed
		at org.apache.http.nio.protocol.HttpAsyncRequestExecutor.endOfInput(HttpAsyncRequestExecutor.java:344)
		at org.apache.http.impl.nio.DefaultNHttpClientConnection.consumeInput(DefaultNHttpClientConnection.java:261)
		... 10 more

This also seems to cause some trouble in the cluster which causes other tests to fail.

Possibly related to #33197 and/or #31195

@elasticmachine

This comment has been minimized.

Copy link

elasticmachine commented Jan 11, 2019

@gwbrown

This comment has been minimized.

Copy link
Contributor

gwbrown commented Jan 11, 2019

After looking more at these, especially https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+6.x+intake/966/console, I'm inclined to say that there might be another problem here. I'm going to leave this open as this test failure may be valid, but I'm not sure it's the only problem - it looks like at least one of the nodes just died in the middle of some of these tests.

@gwbrown

This comment has been minimized.

Copy link
Contributor

gwbrown commented Jan 11, 2019

@jkakavas

This comment has been minimized.

Copy link
Contributor

jkakavas commented Jan 14, 2019

These two reproduce locally but this doesn't seem to be related to the token refresh but to the fact that no master can be elected in the mixed cluster

|    [2019-01-14T09:38:53,098][WARN ][r.suppressed             ] [upgraded-node-1] path: /_cluster/health, params: {wait_for_status=yellow, wait_for_nodes=3}
|    org.elasticsearch.discovery.MasterNotDiscoveredException: null
|       at org.elasticsearch.action.support.master.TransportMasterNodeAction$AsyncSingleAction$4.onTimeout(TransportMasterNodeAction.java:262) [elasticsearch-6.7.0-SNAPSHOT.jar:6.7.0-SNAPSHOT]
|       at org.elasticsearch.cluster.ClusterStateObserver$ContextPreservingListener.onTimeout(ClusterStateObserver.java:322) [elasticsearch-6.7.0-SNAPSHOT.jar:6.7.0-SNAPSHOT]
|       at org.elasticsearch.cluster.ClusterStateObserver$ObserverClusterStateListener.onTimeout(ClusterStateObserver.java:249) [elasticsearch-6.7.0-SNAPSHOT.jar:6.7.0-SNAPSHOT]
|       at org.elasticsearch.cluster.service.ClusterApplierService$NotifyTimeout.run(ClusterApplierService.java:564) [elasticsearch-6.7.0-SNAPSHOT.jar:6.7.0-SNAPSHOT]
|       at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:660) [elasticsearch-6.7.0-SNAPSHOT.jar:6.7.0-SNAPSHOT]
|       at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
|       at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
|       at java.lang.Thread.run(Thread.java:834) [?:?]
|    [2019-01-14T09:38:53,612][DEBUG][o.e.a.a.c.h.TransportClusterHealthAction] [upgraded-node-1] no known master node, scheduling a retry
|    [2019-01-14T09:38:58,228][INFO ][o.e.d.z.ZenDiscovery     ] [upgraded-node-1] failed to send join request to master [{node-2}{aIvm3YvcRsGaZ6uRbbI7Eg}{1glrAnPPTEKyoYgGfe1PFw}{127.0.0.1}{127.0.0.1:33229}{testattr=test, ml.max_open_jobs=10, ml.enabled=true}], reason [ElasticsearchTimeoutException[java.util.concurrent.TimeoutException: Timeout waiting for task.]; nested: TimeoutException[Timeout waiting for task.]; ]
|    [2019-01-14T09:39:23,612][DEBUG][o.e.a.a.c.h.TransportClusterHealthAction] [upgraded-node-1] timed out while retrying [cluster:monitor/health] after failure (timeout [30s])
|    [2019-01-14T09:39:23,614][WARN ][r.suppressed             ] [upgraded-node-1] path: /_cluster/health, params: {wait_for_status=yellow, wait_for_nodes=3}
|    org.elasticsearch.discovery.MasterNotDiscoveredException: null
|       at org.elasticsearch.action.support.master.TransportMasterNodeAction$AsyncSingleAction$4.onTimeout(TransportMasterNodeAction.java:262) [elasticsearch-6.7.0-SNAPSHOT.jar:6.7.0-SNAPSHOT]
|       at org.elasticsearch.cluster.ClusterStateObserver$ContextPreservingListener.onTimeout(ClusterStateObserver.java:322) [elasticsearch-6.7.0-SNAPSHOT.jar:6.7.0-SNAPSHOT]
|       at org.elasticsearch.cluster.ClusterStateObserver$ObserverClusterStateListener.onTimeout(ClusterStateObserver.java:249) [elasticsearch-6.7.0-SNAPSHOT.jar:6.7.0-SNAPSHOT]
|       at org.elasticsearch.cluster.service.ClusterApplierService$NotifyTimeout.run(ClusterApplierService.java:564) [elasticsearch-6.7.0-SNAPSHOT.jar:6.7.0-SNAPSHOT]
|       at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:660) [elasticsearch-6.7.0-SNAPSHOT.jar:6.7.0-SNAPSHOT]
|       at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
|       at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
|       at java.lang.Thread.run(Thread.java:834) [?:?]
|    [2019-01-14T09:39:24,123][DEBUG][o.e.a.a.c.h.TransportClusterHealthAction] [upgraded-node-1] no known master node, scheduling a retry
|    [2019-01-14T09:39:54,124][DEBUG][o.e.a.a.c.h.TransportClusterHealthAction] [upgraded-node-1] timed out while retrying [cluster:monitor/health] after failure (timeout [30s])
|    [2019-01-14T09:39:54,124][WARN ][r.suppressed             ] [upgraded-node-1] path: /_cluster/health, params: {wait_for_status=yellow, wait_for_nodes=3}
|    org.elasticsearch.discovery.MasterNotDiscoveredException: null
|       at org.elasticsearch.action.support.master.TransportMasterNodeAction$AsyncSingleAction$4.onTimeout(TransportMasterNodeAction.java:262) [elasticsearch-6.7.0-SNAPSHOT.jar:6.7.0-SNAPSHOT]
|       at org.elasticsearch.cluster.ClusterStateObserver$ContextPreservingListener.onTimeout(ClusterStateObserver.java:322) [elasticsearch-6.7.0-SNAPSHOT.jar:6.7.0-SNAPSHOT]
|       at org.elasticsearch.cluster.ClusterStateObserver$ObserverClusterStateListener.onTimeout(ClusterStateObserver.java:249) [elasticsearch-6.7.0-SNAPSHOT.jar:6.7.0-SNAPSHOT]
|       at org.elasticsearch.cluster.service.ClusterApplierService$NotifyTimeout.run(ClusterApplierService.java:564) [elasticsearch-6.7.0-SNAPSHOT.jar:6.7.0-SNAPSHOT]
|       at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:660) [elasticsearch-6.7.0-SNAPSHOT.jar:6.7.0-SNAPSHOT]
|       at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
|       at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
|       at java.lang.Thread.run(Thread.java:834) [?:?]

The Token refresh error is seen in relevant logs from the i.e. 335 but the suppressed Exception doesn't make sense to me :

> Caused by: org.elasticsearch.client.ResponseException: method [POST], host [http://127.0.0.1:38358], URI [_xpack/security/oauth2/token], status line [HTTP/1.1 400 Bad Request]
   > {"error":"invalid_grant","error_description":"token has already been refreshed"}
   > 	at org.elasticsearch.client.RestClient$1.completed(RestClient.java:548)
   > 	at org.elasticsearch.client.RestClient$1.completed(RestClient.java:533)
   > 	at org.apache.http.concurrent.BasicFuture.completed(BasicFuture.java:119)
   > 	at org.apache.http.impl.nio.client.DefaultClientExchangeHandlerImpl.responseCompleted(DefaultClientExchangeHandlerImpl.java:177)
   > 	at org.apache.http.nio.protocol.HttpAsyncRequestExecutor.processResponse(HttpAsyncRequestExecutor.java:436)
   > 	at org.apache.http.nio.protocol.HttpAsyncRequestExecutor.inputReady(HttpAsyncRequestExecutor.java:326)
   > 	at org.apache.http.impl.nio.DefaultNHttpClientConnection.consumeInput(DefaultNHttpClientConnection.java:265)
   > 	at org.apache.http.impl.nio.client.InternalIODispatch.onInputReady(InternalIODispatch.java:81)
   > 	at org.apache.http.impl.nio.client.InternalIODispatch.onInputReady(InternalIODispatch.java:39)
   > 	at org.apache.http.impl.nio.reactor.AbstractIODispatch.inputReady(AbstractIODispatch.java:114)
   > 	at org.apache.http.impl.nio.reactor.BaseIOReactor.readable(BaseIOReactor.java:162)
   > 	at org.apache.http.impl.nio.reactor.AbstractIOReactor.processEvent(AbstractIOReactor.java:337)
   > 	at org.apache.http.impl.nio.reactor.AbstractIOReactor.processEvents(AbstractIOReactor.java:315)
   > 	at org.apache.http.impl.nio.reactor.AbstractIOReactor.execute(AbstractIOReactor.java:276)
   > 	at org.apache.http.impl.nio.reactor.BaseIOReactor.execute(BaseIOReactor.java:104)
   > 	at org.apache.http.impl.nio.reactor.AbstractMultiworkerIOReactor$Worker.run(AbstractMultiworkerIOReactor.java:588)
   > 	... 1 more
   > 	Suppressed: org.apache.http.ConnectionClosedException: Connection closed
   > 		at org.apache.http.nio.protocol.HttpAsyncRequestExecutor.endOfInput(HttpAsyncRequestExecutor.java:344)
   > 		at org.apache.http.impl.nio.DefaultNHttpClientConnection.consumeInput(DefaultNHttpClientConnection.java:261)
   > 		... 10 moreThrowable #2: java.io.IOException: listener timeout after waiting for [30000] ms
   > 	at org.elasticsearch.client.RestClient$SyncResponseListener.get(RestClient.java:908)
   > 	at org.elasticsearch.client.RestClient.performRequest(RestClient.java:229)
   > 	at org.elasticsearch.test.rest.ESRestTestCase.wipeSnapshots(ESRestTestCase.java:497)
   > 	at org.elasticsearch.test.rest.ESRestTestCase.wipeCluster(ESRestTestCase.java:469)
   > 	at org.elasticsearch.test.rest.ESRestTestCase.cleanUpCluster(ESRestTestCase.java:263)
   > 	at java.lang.Thread.run(Thread.java:748)

Locally, these tests fail before the test even tries to refresh the token and I see similar MasterNotDiscoveredException messages in the CI logs.

I'll keep looking, but it could be the refresh token error is a red herring

@matriv

This comment has been minimized.

Copy link
Contributor

matriv commented Jan 15, 2019

Happened again today: https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+6.x+intake/1002/console
and reproduced locally on 6.x:

|    [2019-01-15T15:28:19,561][DEBUG][o.e.a.a.c.h.TransportClusterHealthAction] [upgraded-node-1] no known master node, scheduling a retry
|    [2019-01-15T15:28:22,279][WARN ][o.e.d.z.ZenDiscovery     ] [upgraded-node-1] not enough master nodes discovered during pinging (found [[Candidate{node={upgraded-node-1}{gX7MA9FtT-aKFk9CRr-lcQ}{h2_Hsc68QEKyeI2q_Kv7pA}{127.0.0.1}{127.0.0.1:57280}{upgraded=true, ml.machine_memory=34359738368, xpack.installed=true, testattr=test, ml.max_open_jobs=20, ml.enabled=true}, clusterStateVersion=-1}]], but needed [2]), pinging again
|    [2019-01-15T15:28:49,570][DEBUG][o.e.a.a.c.h.TransportClusterHealthAction] [upgraded-node-1] timed out while retrying [cluster:monitor/health] after failure (timeout [30s])
|    [2019-01-15T15:28:49,572][WARN ][r.suppressed             ] [upgraded-node-1] path: /_cluster/health, params: {wait_for_status=yellow, wait_for_nodes=3}
|    org.elasticsearch.discovery.MasterNotDiscoveredException: null
|       at org.elasticsearch.action.support.master.TransportMasterNodeAction$AsyncSingleAction$4.onTimeout(TransportMasterNodeAction.java:262) [elasticsearch-6.7.0-SNAPSHOT.jar:6.7.0-SNAPSHOT]
|       at org.elasticsearch.cluster.ClusterStateObserver$ContextPreservingListener.onTimeout(ClusterStateObserver.java:322) [elasticsearch-6.7.0-SNAPSHOT.jar:6.7.0-SNAPSHOT]
|       at org.elasticsearch.cluster.ClusterStateObserver$ObserverClusterStateListener.onTimeout(ClusterStateObserver.java:249) [elasticsearch-6.7.0-SNAPSHOT.jar:6.7.0-SNAPSHOT]
|       at org.elasticsearch.cluster.service.ClusterApplierService$NotifyTimeout.run(ClusterApplierService.java:564) [elasticsearch-6.7.0-SNAPSHOT.jar:6.7.0-SNAPSHOT]
|       at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:660) [elasticsearch-6.7.0-SNAPSHOT.jar:6.7.0-SNAPSHOT]
|       at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
|       at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
|       at java.lang.Thread.run(Thread.java:834) [?:?]
|    [2019-01-15T15:28:49,894][DEBUG][o.e.a.a.c.h.TransportClusterHealthAction] [upgraded-node-1] no known master node, scheduling a retry
|    [2019-01-15T15:29:19,896][DEBUG][o.e.a.a.c.h.TransportClusterHealthAction] [upgraded-node-1] timed out while retrying [cluster:monitor/health] after failure (timeout [30s])
|    [2019-01-15T15:29:19,898][WARN ][r.suppressed             ] [upgraded-node-1] path: /_cluster/health, params: {wait_for_status=yellow, wait_for_nodes=3}
|    org.elasticsearch.discovery.MasterNotDiscoveredException: null
|       at org.elasticsearch.action.support.master.TransportMasterNodeAction$AsyncSingleAction$4.onTimeout(TransportMasterNodeAction.java:262) [elasticsearch-6.7.0-SNAPSHOT.jar:6.7.0-SNAPSHOT]
|       at org.elasticsearch.cluster.ClusterStateObserver$ContextPreservingListener.onTimeout(ClusterStateObserver.java:322) [elasticsearch-6.7.0-SNAPSHOT.jar:6.7.0-SNAPSHOT]
|       at org.elasticsearch.cluster.ClusterStateObserver$ObserverClusterStateListener.onTimeout(ClusterStateObserver.java:249) [elasticsearch-6.7.0-SNAPSHOT.jar:6.7.0-SNAPSHOT]
|       at org.elasticsearch.cluster.service.ClusterApplierService$NotifyTimeout.run(ClusterApplierService.java:564) [elasticsearch-6.7.0-SNAPSHOT.jar:6.7.0-SNAPSHOT]
|       at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:660) [elasticsearch-6.7.0-SNAPSHOT.jar:6.7.0-SNAPSHOT]
|       at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
|       at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
|       at java.lang.Thread.run(Thread.java:834) [?:?]
|    [2019-01-15T15:29:20,400][DEBUG][o.e.a.a.c.h.TransportClusterHealthAction] [upgraded-node-1] no known master node, scheduling a retry
|    [2019-01-15T15:29:25,379][INFO ][o.e.d.z.ZenDiscovery     ] [upgraded-node-1] failed to send join request to master [{node-2}{WQQi8QhsStKrOLR3HlfSPw}{bx0glClkRDOTHGZGKqQ8fg}{127.0.0.1}{127.0.0.1:56798}{testattr=test, ml.max_open_jobs=10, ml.enabled=true}], reason [ElasticsearchTimeoutException[java.util.concurrent.TimeoutException: Timeout waiting for task.]; nested: TimeoutException[Timeout waiting for task.]; ]
|    [2019-01-15T15:29:50,405][DEBUG][o.e.a.a.c.h.TransportClusterHealthAction] [upgraded-node-1] timed out while retrying [cluster:monitor/health] after failure (timeout [30s])
|    [2019-01-15T15:29:50,406][WARN ][r.suppressed             ] [upgraded-node-1] path: /_cluster/health, params: {wait_for_status=yellow, wait_for_nodes=3}
|    org.elasticsearch.discovery.MasterNotDiscoveredException: null
|       at org.elasticsearch.action.support.master.TransportMasterNodeAction$AsyncSingleAction$4.onTimeout(TransportMasterNodeAction.java:262) [elasticsearch-6.7.0-SNAPSHOT.jar:6.7.0-SNAPSHOT]
|       at org.elasticsearch.cluster.ClusterStateObserver$ContextPreservingListener.onTimeout(ClusterStateObserver.java:322) [elasticsearch-6.7.0-SNAPSHOT.jar:6.7.0-SNAPSHOT]
|       at org.elasticsearch.cluster.ClusterStateObserver$ObserverClusterStateListener.onTimeout(ClusterStateObserver.java:249) [elasticsearch-6.7.0-SNAPSHOT.jar:6.7.0-SNAPSHOT]
|       at org.elasticsearch.cluster.service.ClusterApplierService$NotifyTimeout.run(ClusterApplierService.java:564) [elasticsearch-6.7.0-SNAPSHOT.jar:6.7.0-SNAPSHOT]
|       at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:660) [elasticsearch-6.7.0-SNAPSHOT.jar:6.7.0-SNAPSHOT]
|       at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
|       at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
|       at java.lang.Thread.run(Thread.java:834) [?:?]
|    [2019-01-15T15:29:50,912][DEBUG][o.e.a.a.c.h.TransportClusterHealthAction] [upgraded-node-1] no known master node, scheduling a retry
|    [2019-01-15T15:30:20,916][DEBUG][o.e.a.a.c.h.TransportClusterHealthAction] [upgraded-node-1] timed out while retrying [cluster:monitor/health] after failure (timeout [30s])
|    [2019-01-15T15:30:20,918][WARN ][r.suppressed             ] [upgraded-node-1] path: /_cluster/health, params: {wait_for_status=yellow, wait_for_nodes=3}
|    org.elasticsearch.discovery.MasterNotDiscoveredException: null
|       at org.elasticsearch.action.support.master.TransportMasterNodeAction$AsyncSingleAction$4.onTimeout(TransportMasterNodeAction.java:262) [elasticsearch-6.7.0-SNAPSHOT.jar:6.7.0-SNAPSHOT]
|       at org.elasticsearch.cluster.ClusterStateObserver$ContextPreservingListener.onTimeout(ClusterStateObserver.java:322) [elasticsearch-6.7.0-SNAPSHOT.jar:6.7.0-SNAPSHOT]
|       at org.elasticsearch.cluster.ClusterStateObserver$ObserverClusterStateListener.onTimeout(ClusterStateObserver.java:249) [elasticsearch-6.7.0-SNAPSHOT.jar:6.7.0-SNAPSHOT]
|       at org.elasticsearch.cluster.service.ClusterApplierService$NotifyTimeout.run(ClusterApplierService.java:564) [elasticsearch-6.7.0-SNAPSHOT.jar:6.7.0-SNAPSHOT]
|       at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:660) [elasticsearch-6.7.0-SNAPSHOT.jar:6.7.0-SNAPSHOT]
|       at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
|       at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
|       at java.lang.Thread.run(Thread.java:834) [?:?]
|-----------------------------------------

FAILURE: Build failed with an exception.

* What went wrong:
Execution failed for task ':x-pack:qa:rolling-upgrade:without-system-key:v5.6.15#twoThirdsUpgradedTestCluster#wait'.
> Elasticsearch cluster failed to pass wait condition
@jkakavas

This comment has been minimized.

Copy link
Contributor

jkakavas commented Jan 15, 2019

👀

matriv added a commit that referenced this issue Jan 15, 2019

matriv added a commit that referenced this issue Jan 15, 2019

@matriv

This comment has been minimized.

Copy link
Contributor

matriv commented Jan 15, 2019

The tests are muted in master and 6.x as they fail quite often.

@matriv

This comment has been minimized.

Copy link
Contributor

matriv commented Jan 15, 2019

I reverted muting it on master as it seems to fail only in 6.x
6129e9d

@jkakavas

This comment has been minimized.

Copy link
Contributor

jkakavas commented Jan 16, 2019

Happened again today: https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+6.x+intake/1002/console
and reproduced locally on 6.x:

The local reproduction doesn't have to do with the original error ( failed to refresh token ) that we encounter on CI (see the explanation here).

I think the way to try and reproduce these locally is to

$ cd x-pack/qa/rolling-upgrade/without-system-key
$ ../../../../gradlew check -Dtests.seed=2AFD25BC2B1F75AB -Dtests.class=org.elasticsearch.upgrades.TokenBackwardsCompatibilityIT  -Dtests.security.manager=true -Dtests.locale=fr-FR -Dtests.timezone=IST -Dcompiler.java=11 -Druntime.java=8

which doesn't reproduce for me.

Looking into the rest of the failures I came to realize that this might fail more consistently than we think. The mixedClusterTest is only run in the twoThirdsUpdated state but only if the master is on the new version. That means that if the original master is the last node to be updated, then the mixedClusterTest will not run. In all the successful CI builds I've seen, the test is actually skipped

IGNOR/A 0.08s | TokenBackwardsCompatibilityIT.testMixedCluster
   > Assumption #1: the master must be on the latest version before we can write

Now this makes this failure even more strange, as
a) It never reproduces locally
b) The test itself creates the token and refresh token just before it tries to use the refresh token, so I can't really figure out how it gets already refreshed.

I'll keep looking

@jkakavas

This comment has been minimized.

Copy link
Contributor

jkakavas commented Jan 16, 2019

I unmuted the test in 05b2c80 hoping to get some additional debugging information as this never fails locally. I will keep a close eye in 6.x and mute this test again so that I won't add too much noise.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment