Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CI] Failure in org.elasticsearch.cluster.MinimumMasterNodesIT #57585

Closed
original-brownbear opened this issue Jun 3, 2020 · 2 comments · Fixed by #57608
Closed

[CI] Failure in org.elasticsearch.cluster.MinimumMasterNodesIT #57585

original-brownbear opened this issue Jun 3, 2020 · 2 comments · Fixed by #57608
Assignees
Labels
:Distributed/Distributed A catch all label for anything in the Distributed Area. If you aren't sure, use this one. Team:Distributed Meta label for distributed team >test-failure Triaged test failures from CI

Comments

@original-brownbear
Copy link
Member

Failed on CI and once locally for me https://gradle-enterprise.elastic.co/s/3lz2nfmnfm5xm/

We're tripping an assertion in this test:

./gradlew ':server:internalClusterTest' --tests "org.elasticsearch.cluster.MinimumMasterNodesIT" -Dtests.seed=2B9777EED69586A8 -Dtests.security.manager=true -Dtests.locale=en-US -Dtests.timezone=Etc/UTC -Druntime.java=11
com.carrotsearch.randomizedtesting.UncaughtExceptionError: Captured an uncaught exception in thread: Thread[id=310, name=elasticsearch[node_t2][scheduler][T#1], state=RUNNABLE, group=TGRP-MinimumMasterNodesIT]
Caused by: java.lang.AssertionError: Expected current thread [Thread[elasticsearch[node_t2][scheduler][T#1],5,TGRP-MinimumMasterNodesIT]] to not be the scheduler thread. Reason: [Blocking operation]
	at __randomizedtesting.SeedInfo.seed([2B9777EED69586A8]:0)
	at org.elasticsearch.threadpool.ThreadPool.assertNotScheduleThread(ThreadPool.java:731)
	at org.elasticsearch.common.util.concurrent.BaseFuture.blockingAllowed(BaseFuture.java:93)
	at org.elasticsearch.common.util.concurrent.BaseFuture.get(BaseFuture.java:86)
	at org.elasticsearch.common.util.concurrent.FutureUtils.get(FutureUtils.java:56)
	at org.elasticsearch.indices.recovery.RecoverySourceHandler.lambda$acquireStore$22(RecoverySourceHandler.java:408)
	at org.elasticsearch.core.internal.io.IOUtils.close(IOUtils.java:106)
	at org.elasticsearch.core.internal.io.IOUtils.close(IOUtils.java:88)
	at org.elasticsearch.common.lease.Releasables.close(Releasables.java:36)
	at org.elasticsearch.common.lease.Releasables.close(Releasables.java:46)
	at org.elasticsearch.common.lease.Releasables.close(Releasables.java:51)
	at org.elasticsearch.common.lease.Releasables.lambda$releaseOnce$2(Releasables.java:105)
	at org.elasticsearch.core.internal.io.IOUtils.close(IOUtils.java:106)
	at org.elasticsearch.core.internal.io.IOUtils.close(IOUtils.java:64)
	at org.elasticsearch.indices.recovery.RecoverySourceHandler.lambda$recoverToTarget$8(RecoverySourceHandler.java:248)
	at org.elasticsearch.action.ActionListener$1.onFailure(ActionListener.java:71)
	at org.elasticsearch.action.ActionRunnable.onFailure(ActionRunnable.java:88)
	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:39)
	at org.elasticsearch.common.util.concurrent.EsExecutors$DirectExecutorService.execute(EsExecutors.java:178)
	at org.elasticsearch.common.util.concurrent.ListenableFuture.notifyListener(ListenableFuture.java:106)
	at org.elasticsearch.common.util.concurrent.ListenableFuture.lambda$done$0(ListenableFuture.java:98)
	at java.base/java.util.ArrayList.forEach(ArrayList.java:1540)
	at org.elasticsearch.common.util.concurrent.ListenableFuture.done(ListenableFuture.java:98)
	at org.elasticsearch.common.util.concurrent.BaseFuture.setException(BaseFuture.java:162)
	at org.elasticsearch.common.util.concurrent.ListenableFuture.onFailure(ListenableFuture.java:135)
	at org.elasticsearch.action.StepListener.innerOnFailure(StepListener.java:67)
	at org.elasticsearch.action.NotifyOnceListener.onFailure(NotifyOnceListener.java:47)
	at org.elasticsearch.action.ActionListener$1.onFailure(ActionListener.java:71)
	at org.elasticsearch.action.ActionRunnable.onFailure(ActionRunnable.java:88)
	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:39)
	at org.elasticsearch.common.util.concurrent.EsExecutors$DirectExecutorService.execute(EsExecutors.java:178)
	at org.elasticsearch.common.util.concurrent.ListenableFuture.notifyListener(ListenableFuture.java:106)
	at org.elasticsearch.common.util.concurrent.ListenableFuture.lambda$done$0(ListenableFuture.java:98)
	at java.base/java.util.ArrayList.forEach(ArrayList.java:1540)
	at org.elasticsearch.common.util.concurrent.ListenableFuture.done(ListenableFuture.java:98)
	at org.elasticsearch.common.util.concurrent.BaseFuture.setException(BaseFuture.java:162)
	at org.elasticsearch.common.util.concurrent.ListenableFuture.onFailure(ListenableFuture.java:135)
	at org.elasticsearch.action.StepListener.innerOnFailure(StepListener.java:67)
	at org.elasticsearch.action.NotifyOnceListener.onFailure(NotifyOnceListener.java:47)
	at org.elasticsearch.indices.recovery.MultiFileTransfer.onCompleted(MultiFileTransfer.java:146)
	at org.elasticsearch.indices.recovery.MultiFileTransfer.handleItems(MultiFileTransfer.java:134)
	at org.elasticsearch.indices.recovery.MultiFileTransfer$1.write(MultiFileTransfer.java:79)
	at org.elasticsearch.common.util.concurrent.AsyncIOProcessor.processList(AsyncIOProcessor.java:108)
	at org.elasticsearch.common.util.concurrent.AsyncIOProcessor.drainAndProcessAndRelease(AsyncIOProcessor.java:96)
	at org.elasticsearch.common.util.concurrent.AsyncIOProcessor.put(AsyncIOProcessor.java:84)
	at org.elasticsearch.indices.recovery.MultiFileTransfer.addItem(MultiFileTransfer.java:90)
	at org.elasticsearch.indices.recovery.MultiFileTransfer.lambda$handleItems$4(MultiFileTransfer.java:126)
	at org.elasticsearch.action.ActionListener$1.onFailure(ActionListener.java:71)
	at org.elasticsearch.action.ActionListener$6.onFailure(ActionListener.java:292)
	at org.elasticsearch.action.ActionListener$4.onFailure(ActionListener.java:173)
	at org.elasticsearch.action.ActionListener$6.onFailure(ActionListener.java:292)
	at org.elasticsearch.action.support.RetryableAction$RetryingListener.onFailure(RetryableAction.java:149)
	at org.elasticsearch.action.ActionListenerResponseHandler.handleException(ActionListenerResponseHandler.java:59)
	at org.elasticsearch.transport.TransportService.sendRequest(TransportService.java:545)
	at org.elasticsearch.indices.recovery.RemoteRecoveryTargetHandler$1.tryAction(RemoteRecoveryTargetHandler.java:258)
	at org.elasticsearch.action.support.RetryableAction$1.doRun(RetryableAction.java:98)
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:691)
	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
	at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
	at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
	at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
	at java.base/java.lang.Thread.run(Thread.java:834)

The issue seems to be with the fact that we are scheduling some of the retrying in org.elasticsearch.action.support.RetryableAction on SAME so it runs on the scheduler thread. This in turn in some corner cases will use blocking get on a future and trip the above assertion.

@original-brownbear original-brownbear added >test-failure Triaged test failures from CI :Distributed/Distributed A catch all label for anything in the Distributed Area. If you aren't sure, use this one. labels Jun 3, 2020
@original-brownbear original-brownbear self-assigned this Jun 3, 2020
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed (:Distributed/Distributed)

@elasticmachine elasticmachine added the Team:Distributed Meta label for distributed team label Jun 3, 2020
@original-brownbear
Copy link
Member Author

This is a bit of a variation of #46178 ... now that we have retries in the recovery it seems we get stuck in a situation where we submit a task that cleans up an index commit ref to the generic pool but then shut down that pool without the task ever executing.
The fact that we see the above error on the scheduler thread is just a symptom of failing to wait (though we might still want to clean this up because we shouldn't have a blocking action run on the scheduler thread). Looking for a fix now

original-brownbear added a commit to original-brownbear/elasticsearch that referenced this issue Jun 3, 2020
If a node is disconnected we retry. It does not make sense
to retry the recovery if the node is removed from the cluster though.
=> added a check for the node being part of the cluster before retrying

Also, we were running the retry on the `SAME` pool which for each retry will
be the scheduler pool. Since the error path of the listener we use here
will do blocking operations when closing the resources used by the recovery
we can't use the `SAME` pool here since not all exceptions go to the `ActionListenerResponseHandler`
threading like e.g. `NodeNotConnectedException`.

Closes elastic#57585
original-brownbear added a commit that referenced this issue Jun 10, 2020
If a node is disconnected we retry. It does not make sense
to retry the recovery if the node is removed from the cluster though.
=> added a CS listener that cancels the recovery for removed nodes

Also, we were running the retry on the `SAME` pool which for each retry will
be the scheduler pool. Since the error path of the listener we use here
will do blocking operations when closing the resources used by the recovery
we can't use the `SAME` pool here since not all exceptions go to the `ActionListenerResponseHandler`
threading like e.g. `NodeNotConnectedException`.

Closes #57585
original-brownbear added a commit to original-brownbear/elasticsearch that referenced this issue Jun 10, 2020
If a node is disconnected we retry. It does not make sense
to retry the recovery if the node is removed from the cluster though.
=> added a CS listener that cancels the recovery for removed nodes

Also, we were running the retry on the `SAME` pool which for each retry will
be the scheduler pool. Since the error path of the listener we use here
will do blocking operations when closing the resources used by the recovery
we can't use the `SAME` pool here since not all exceptions go to the `ActionListenerResponseHandler`
threading like e.g. `NodeNotConnectedException`.

Closes elastic#57585
original-brownbear added a commit that referenced this issue Jun 10, 2020
If a node is disconnected we retry. It does not make sense
to retry the recovery if the node is removed from the cluster though.
=> added a CS listener that cancels the recovery for removed nodes

Also, we were running the retry on the `SAME` pool which for each retry will
be the scheduler pool. Since the error path of the listener we use here
will do blocking operations when closing the resources used by the recovery
we can't use the `SAME` pool here since not all exceptions go to the `ActionListenerResponseHandler`
threading like e.g. `NodeNotConnectedException`.

Closes #57585
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed/Distributed A catch all label for anything in the Distributed Area. If you aren't sure, use this one. Team:Distributed Meta label for distributed team >test-failure Triaged test failures from CI
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants