[CI] Failure in org.elasticsearch.cluster.MinimumMasterNodesIT #57585

original-brownbear · 2020-06-03T07:03:38Z

Failed on CI and once locally for me https://gradle-enterprise.elastic.co/s/3lz2nfmnfm5xm/

We're tripping an assertion in this test:

./gradlew ':server:internalClusterTest' --tests "org.elasticsearch.cluster.MinimumMasterNodesIT" -Dtests.seed=2B9777EED69586A8 -Dtests.security.manager=true -Dtests.locale=en-US -Dtests.timezone=Etc/UTC -Druntime.java=11

com.carrotsearch.randomizedtesting.UncaughtExceptionError: Captured an uncaught exception in thread: Thread[id=310, name=elasticsearch[node_t2][scheduler][T#1], state=RUNNABLE, group=TGRP-MinimumMasterNodesIT]
Caused by: java.lang.AssertionError: Expected current thread [Thread[elasticsearch[node_t2][scheduler][T#1],5,TGRP-MinimumMasterNodesIT]] to not be the scheduler thread. Reason: [Blocking operation]
	at __randomizedtesting.SeedInfo.seed([2B9777EED69586A8]:0)
	at org.elasticsearch.threadpool.ThreadPool.assertNotScheduleThread(ThreadPool.java:731)
	at org.elasticsearch.common.util.concurrent.BaseFuture.blockingAllowed(BaseFuture.java:93)
	at org.elasticsearch.common.util.concurrent.BaseFuture.get(BaseFuture.java:86)
	at org.elasticsearch.common.util.concurrent.FutureUtils.get(FutureUtils.java:56)
	at org.elasticsearch.indices.recovery.RecoverySourceHandler.lambda$acquireStore$22(RecoverySourceHandler.java:408)
	at org.elasticsearch.core.internal.io.IOUtils.close(IOUtils.java:106)
	at org.elasticsearch.core.internal.io.IOUtils.close(IOUtils.java:88)
	at org.elasticsearch.common.lease.Releasables.close(Releasables.java:36)
	at org.elasticsearch.common.lease.Releasables.close(Releasables.java:46)
	at org.elasticsearch.common.lease.Releasables.close(Releasables.java:51)
	at org.elasticsearch.common.lease.Releasables.lambda$releaseOnce$2(Releasables.java:105)
	at org.elasticsearch.core.internal.io.IOUtils.close(IOUtils.java:106)
	at org.elasticsearch.core.internal.io.IOUtils.close(IOUtils.java:64)
	at org.elasticsearch.indices.recovery.RecoverySourceHandler.lambda$recoverToTarget$8(RecoverySourceHandler.java:248)
	at org.elasticsearch.action.ActionListener$1.onFailure(ActionListener.java:71)
	at org.elasticsearch.action.ActionRunnable.onFailure(ActionRunnable.java:88)
	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:39)
	at org.elasticsearch.common.util.concurrent.EsExecutors$DirectExecutorService.execute(EsExecutors.java:178)
	at org.elasticsearch.common.util.concurrent.ListenableFuture.notifyListener(ListenableFuture.java:106)
	at org.elasticsearch.common.util.concurrent.ListenableFuture.lambda$done$0(ListenableFuture.java:98)
	at java.base/java.util.ArrayList.forEach(ArrayList.java:1540)
	at org.elasticsearch.common.util.concurrent.ListenableFuture.done(ListenableFuture.java:98)
	at org.elasticsearch.common.util.concurrent.BaseFuture.setException(BaseFuture.java:162)
	at org.elasticsearch.common.util.concurrent.ListenableFuture.onFailure(ListenableFuture.java:135)
	at org.elasticsearch.action.StepListener.innerOnFailure(StepListener.java:67)
	at org.elasticsearch.action.NotifyOnceListener.onFailure(NotifyOnceListener.java:47)
	at org.elasticsearch.action.ActionListener$1.onFailure(ActionListener.java:71)
	at org.elasticsearch.action.ActionRunnable.onFailure(ActionRunnable.java:88)
	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:39)
	at org.elasticsearch.common.util.concurrent.EsExecutors$DirectExecutorService.execute(EsExecutors.java:178)
	at org.elasticsearch.common.util.concurrent.ListenableFuture.notifyListener(ListenableFuture.java:106)
	at org.elasticsearch.common.util.concurrent.ListenableFuture.lambda$done$0(ListenableFuture.java:98)
	at java.base/java.util.ArrayList.forEach(ArrayList.java:1540)
	at org.elasticsearch.common.util.concurrent.ListenableFuture.done(ListenableFuture.java:98)
	at org.elasticsearch.common.util.concurrent.BaseFuture.setException(BaseFuture.java:162)
	at org.elasticsearch.common.util.concurrent.ListenableFuture.onFailure(ListenableFuture.java:135)
	at org.elasticsearch.action.StepListener.innerOnFailure(StepListener.java:67)
	at org.elasticsearch.action.NotifyOnceListener.onFailure(NotifyOnceListener.java:47)
	at org.elasticsearch.indices.recovery.MultiFileTransfer.onCompleted(MultiFileTransfer.java:146)
	at org.elasticsearch.indices.recovery.MultiFileTransfer.handleItems(MultiFileTransfer.java:134)
	at org.elasticsearch.indices.recovery.MultiFileTransfer$1.write(MultiFileTransfer.java:79)
	at org.elasticsearch.common.util.concurrent.AsyncIOProcessor.processList(AsyncIOProcessor.java:108)
	at org.elasticsearch.common.util.concurrent.AsyncIOProcessor.drainAndProcessAndRelease(AsyncIOProcessor.java:96)
	at org.elasticsearch.common.util.concurrent.AsyncIOProcessor.put(AsyncIOProcessor.java:84)
	at org.elasticsearch.indices.recovery.MultiFileTransfer.addItem(MultiFileTransfer.java:90)
	at org.elasticsearch.indices.recovery.MultiFileTransfer.lambda$handleItems$4(MultiFileTransfer.java:126)
	at org.elasticsearch.action.ActionListener$1.onFailure(ActionListener.java:71)
	at org.elasticsearch.action.ActionListener$6.onFailure(ActionListener.java:292)
	at org.elasticsearch.action.ActionListener$4.onFailure(ActionListener.java:173)
	at org.elasticsearch.action.ActionListener$6.onFailure(ActionListener.java:292)
	at org.elasticsearch.action.support.RetryableAction$RetryingListener.onFailure(RetryableAction.java:149)
	at org.elasticsearch.action.ActionListenerResponseHandler.handleException(ActionListenerResponseHandler.java:59)
	at org.elasticsearch.transport.TransportService.sendRequest(TransportService.java:545)
	at org.elasticsearch.indices.recovery.RemoteRecoveryTargetHandler$1.tryAction(RemoteRecoveryTargetHandler.java:258)
	at org.elasticsearch.action.support.RetryableAction$1.doRun(RetryableAction.java:98)
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:691)
	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
	at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
	at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
	at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
	at java.base/java.lang.Thread.run(Thread.java:834)

The issue seems to be with the fact that we are scheduling some of the retrying in org.elasticsearch.action.support.RetryableAction on SAME so it runs on the scheduler thread. This in turn in some corner cases will use blocking get on a future and trip the above assertion.

The text was updated successfully, but these errors were encountered:

elasticmachine · 2020-06-03T07:03:40Z

Pinging @elastic/es-distributed (:Distributed/Distributed)

original-brownbear · 2020-06-03T10:51:55Z

This is a bit of a variation of #46178 ... now that we have retries in the recovery it seems we get stuck in a situation where we submit a task that cleans up an index commit ref to the generic pool but then shut down that pool without the task ever executing.
The fact that we see the above error on the scheduler thread is just a symptom of failing to wait (though we might still want to clean this up because we shouldn't have a blocking action run on the scheduler thread). Looking for a fix now

If a node is disconnected we retry. It does not make sense to retry the recovery if the node is removed from the cluster though. => added a check for the node being part of the cluster before retrying Also, we were running the retry on the `SAME` pool which for each retry will be the scheduler pool. Since the error path of the listener we use here will do blocking operations when closing the resources used by the recovery we can't use the `SAME` pool here since not all exceptions go to the `ActionListenerResponseHandler` threading like e.g. `NodeNotConnectedException`. Closes elastic#57585

If a node is disconnected we retry. It does not make sense to retry the recovery if the node is removed from the cluster though. => added a CS listener that cancels the recovery for removed nodes Also, we were running the retry on the `SAME` pool which for each retry will be the scheduler pool. Since the error path of the listener we use here will do blocking operations when closing the resources used by the recovery we can't use the `SAME` pool here since not all exceptions go to the `ActionListenerResponseHandler` threading like e.g. `NodeNotConnectedException`. Closes #57585

If a node is disconnected we retry. It does not make sense to retry the recovery if the node is removed from the cluster though. => added a CS listener that cancels the recovery for removed nodes Also, we were running the retry on the `SAME` pool which for each retry will be the scheduler pool. Since the error path of the listener we use here will do blocking operations when closing the resources used by the recovery we can't use the `SAME` pool here since not all exceptions go to the `ActionListenerResponseHandler` threading like e.g. `NodeNotConnectedException`. Closes elastic#57585

If a node is disconnected we retry. It does not make sense to retry the recovery if the node is removed from the cluster though. => added a CS listener that cancels the recovery for removed nodes Also, we were running the retry on the `SAME` pool which for each retry will be the scheduler pool. Since the error path of the listener we use here will do blocking operations when closing the resources used by the recovery we can't use the `SAME` pool here since not all exceptions go to the `ActionListenerResponseHandler` threading like e.g. `NodeNotConnectedException`. Closes #57585

original-brownbear added >test-failure Triaged test failures from CI :Distributed/Distributed A catch all label for anything in the Distributed Area. If you aren't sure, use this one. labels Jun 3, 2020

original-brownbear self-assigned this Jun 3, 2020

elasticmachine added the Team:Distributed Meta label for distributed team label Jun 3, 2020

original-brownbear mentioned this issue Jun 3, 2020

Fix Remote Recovery Being Retried for Removed Nodes #57608

Merged

original-brownbear closed this as completed in #57608 Jun 10, 2020

original-brownbear mentioned this issue Jun 10, 2020

Fix Remote Recovery Being Retried for Removed Nodes (#57608) #57913

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CI] Failure in org.elasticsearch.cluster.MinimumMasterNodesIT #57585

[CI] Failure in org.elasticsearch.cluster.MinimumMasterNodesIT #57585

original-brownbear commented Jun 3, 2020

elasticmachine commented Jun 3, 2020

original-brownbear commented Jun 3, 2020

[CI] Failure in org.elasticsearch.cluster.MinimumMasterNodesIT #57585

[CI] Failure in org.elasticsearch.cluster.MinimumMasterNodesIT #57585

Comments

original-brownbear commented Jun 3, 2020

elasticmachine commented Jun 3, 2020

original-brownbear commented Jun 3, 2020