-
Notifications
You must be signed in to change notification settings - Fork 24.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[CI] Failure in org.elasticsearch.cluster.MinimumMasterNodesIT #57585
Comments
Pinging @elastic/es-distributed (:Distributed/Distributed) |
This is a bit of a variation of #46178 ... now that we have retries in the recovery it seems we get stuck in a situation where we submit a task that cleans up an index commit ref to the generic pool but then shut down that pool without the task ever executing. |
If a node is disconnected we retry. It does not make sense to retry the recovery if the node is removed from the cluster though. => added a check for the node being part of the cluster before retrying Also, we were running the retry on the `SAME` pool which for each retry will be the scheduler pool. Since the error path of the listener we use here will do blocking operations when closing the resources used by the recovery we can't use the `SAME` pool here since not all exceptions go to the `ActionListenerResponseHandler` threading like e.g. `NodeNotConnectedException`. Closes elastic#57585
If a node is disconnected we retry. It does not make sense to retry the recovery if the node is removed from the cluster though. => added a CS listener that cancels the recovery for removed nodes Also, we were running the retry on the `SAME` pool which for each retry will be the scheduler pool. Since the error path of the listener we use here will do blocking operations when closing the resources used by the recovery we can't use the `SAME` pool here since not all exceptions go to the `ActionListenerResponseHandler` threading like e.g. `NodeNotConnectedException`. Closes #57585
If a node is disconnected we retry. It does not make sense to retry the recovery if the node is removed from the cluster though. => added a CS listener that cancels the recovery for removed nodes Also, we were running the retry on the `SAME` pool which for each retry will be the scheduler pool. Since the error path of the listener we use here will do blocking operations when closing the resources used by the recovery we can't use the `SAME` pool here since not all exceptions go to the `ActionListenerResponseHandler` threading like e.g. `NodeNotConnectedException`. Closes elastic#57585
If a node is disconnected we retry. It does not make sense to retry the recovery if the node is removed from the cluster though. => added a CS listener that cancels the recovery for removed nodes Also, we were running the retry on the `SAME` pool which for each retry will be the scheduler pool. Since the error path of the listener we use here will do blocking operations when closing the resources used by the recovery we can't use the `SAME` pool here since not all exceptions go to the `ActionListenerResponseHandler` threading like e.g. `NodeNotConnectedException`. Closes #57585
Failed on CI and once locally for me https://gradle-enterprise.elastic.co/s/3lz2nfmnfm5xm/
We're tripping an assertion in this test:
The issue seems to be with the fact that we are scheduling some of the retrying in
org.elasticsearch.action.support.RetryableAction
onSAME
so it runs on the scheduler thread. This in turn in some corner cases will use blockingget
on a future and trip the above assertion.The text was updated successfully, but these errors were encountered: