-
Notifications
You must be signed in to change notification settings - Fork 24.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Abort downsample persistent task if source index disappeared during task re-assignment. #98769
Abort downsample persistent task if source index disappeared during task re-assignment. #98769
Conversation
…ask re-assignment. The result of this issue is that a persistent gets stuck forever. The following errors can occur in the es log: ``` [2023-08-22T17:35:08,733][WARN ][o.e.c.s.ClusterApplierService] [node_t0] failed to notify ClusterStateListener 1> org.elasticsearch.index.IndexNotFoundException: no such index [avrcruawcg] 1> at org.elasticsearch.cluster.routing.RoutingTable.shardRoutingTable(RoutingTable.java:142) ~[elasticsearch-8.10.0-SNAPSHOT.jar:8.10.0-SNAPSHOT] 1> at org.elasticsearch.xpack.downsample.DownsampleShardPersistentTaskExecutor.getAssignment(DownsampleShardPersistentTaskExecutor.java:113) ~[main/:?] 1> at org.elasticsearch.xpack.downsample.DownsampleShardPersistentTaskExecutor.getAssignment(DownsampleShardPersistentTaskExecutor.java:51) ~[main/:?] 1> at org.elasticsearch.persistent.PersistentTasksClusterService.createAssignment(PersistentTasksClusterService.java:351) ~[elasticsearch-8.10.0-SNAPSHOT.jar:8.10.0-SNAPSHOT] 1> at org.elasticsearch.persistent.PersistentTasksClusterService.shouldReassignPersistentTasks(PersistentTasksClusterService.java:438) ~[elasticsearch-8.10.0-SNAPSHOT.jar:8.10.0-SNAPSHOT] 1> at org.elasticsearch.persistent.PersistentTasksClusterService.clusterChanged(PersistentTasksClusterService.java:366) ~[elasticsearch-8.10.0-SNAPSHOT.jar:8.10.0-SNAPSHOT] 1> at org.elasticsearch.cluster.service.ClusterApplierService.callClusterStateListener(ClusterApplierService.java:560) ~[elasticsearch-8.10.0-SNAPSHOT.jar:8.10.0-SNAPSHOT] 1> at org.elasticsearch.cluster.service.ClusterApplierService.callClusterStateListeners(ClusterApplierService.java:546) ~[elasticsearch-8.10.0-SNAPSHOT.jar:8.10.0-SNAPSHOT] 1> at org.elasticsearch.cluster.service.ClusterApplierService.applyChanges(ClusterApplierService.java:505) ~[elasticsearch-8.10.0-SNAPSHOT.jar:8.10.0-SNAPSHOT] 1> at org.elasticsearch.cluster.service.ClusterApplierService.runTask(ClusterApplierService.java:429) ~[elasticsearch-8.10.0-SNAPSHOT.jar:8.10.0-SNAPSHOT] 1> at org.elasticsearch.cluster.service.ClusterApplierService$UpdateTask.run(ClusterApplierService.java:154) ~[elasticsearch-8.10.0-SNAPSHOT.jar:8.10.0-SNAPSHOT] 1> at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:916) ~[elasticsearch-8.10.0-SNAPSHOT.jar:8.10.0-SNAPSHOT] 1> at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:217) ~[elasticsearch-8.10.0-SNAPSHOT.jar:8.10.0-SNAPSHOT] 1> at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:183) ~[elasticsearch-8.10.0-SNAPSHOT.jar:8.10.0-SNAPSHOT] 1> at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144) ~[?:?] 1> at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642) ~[?:?] ``` This changes checks for this situation and allows these downsample tasks to fail and cleanup. This is done by assigning it to a data node, so that when the operation is executed on that node it fails and cleaned up. (because the index couldn't be found). Relates to elastic#98764
Pinging @elastic/es-analytics-geo (Team:Analytics) |
// If during re-assignment the source index disappeared, then we need to break out. | ||
// Returning NO_NODE_FOUND just keeps the persistent task until the source index appears again (which would never happen) | ||
// So let's return a node and then in the node operation we would just fail and stop this persistent task | ||
var indexShardRouting = clusterState.routingTable().shardRoutingTable(params.shardId()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand why we do this "before" the next if block.
Wouldn't it be better to do it inside the if block if we do not find any node? My understading is that if we try to find a node and we happen not to find any then it could be because the index is not there anymore.
Then maybe we can just add a check into the Downsample task which checks for the existence of the source index at the very beginning and fails with an "unrecoverable" error. In that case the executor would not retry right?
This way we can also terminate all the tasks running on each shard...?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or maybe we don't even need this logic at all...
in the Downsample task, at the very beginning when fetching the initial state we could check if the index exists. If not we can fail with an "unrecoverable" error. That should stop the executor from indefinitely retrying right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wait I see what is the issue here...if the index is deleted there is no shard with index data -> there is no shard to start a task and we are not able to start a downsample task to check the existence of the index...so we just need to start the task on any available node holding that shard (there is no such shard because the index is deleted) just because we need to run the task to be able to realize the index is not there anymore.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand why we do this "before" the next if block.
This is because with: final ShardRouting shardRouting = clusterState.routingTable().shardRoutingTable(shardId).primaryShard();
, this part shardRoutingTable(shardId)
will fail. This is why I moved this up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wait I see what is the issue here...if the index is deleted there is no shard with index data -> there is no shard to start a task and we are not able to start a downsample task to check the existence of the index...
Yes, we're basically stuck forever. The getting an assignment fails with an exception and the persistent taks framework tries again on next cluster state update. But it will keep on failing because the index has been removed.
@elasticmachine update branch |
…ask re-assignment. (elastic#98769) The result of this issue is that a persistent gets stuck forever. The following errors can occur in the es log: ``` [2023-08-22T17:35:08,733][WARN ][o.e.c.s.ClusterApplierService] [node_t0] failed to notify ClusterStateListener 1> org.elasticsearch.index.IndexNotFoundException: no such index [avrcruawcg] 1> at org.elasticsearch.cluster.routing.RoutingTable.shardRoutingTable(RoutingTable.java:142) ~[elasticsearch-8.10.0-SNAPSHOT.jar:8.10.0-SNAPSHOT] 1> at org.elasticsearch.xpack.downsample.DownsampleShardPersistentTaskExecutor.getAssignment(DownsampleShardPersistentTaskExecutor.java:113) ~[main/:?] 1> at org.elasticsearch.xpack.downsample.DownsampleShardPersistentTaskExecutor.getAssignment(DownsampleShardPersistentTaskExecutor.java:51) ~[main/:?] 1> at org.elasticsearch.persistent.PersistentTasksClusterService.createAssignment(PersistentTasksClusterService.java:351) ~[elasticsearch-8.10.0-SNAPSHOT.jar:8.10.0-SNAPSHOT] 1> at org.elasticsearch.persistent.PersistentTasksClusterService.shouldReassignPersistentTasks(PersistentTasksClusterService.java:438) ~[elasticsearch-8.10.0-SNAPSHOT.jar:8.10.0-SNAPSHOT] 1> at org.elasticsearch.persistent.PersistentTasksClusterService.clusterChanged(PersistentTasksClusterService.java:366) ~[elasticsearch-8.10.0-SNAPSHOT.jar:8.10.0-SNAPSHOT] 1> at org.elasticsearch.cluster.service.ClusterApplierService.callClusterStateListener(ClusterApplierService.java:560) ~[elasticsearch-8.10.0-SNAPSHOT.jar:8.10.0-SNAPSHOT] 1> at org.elasticsearch.cluster.service.ClusterApplierService.callClusterStateListeners(ClusterApplierService.java:546) ~[elasticsearch-8.10.0-SNAPSHOT.jar:8.10.0-SNAPSHOT] 1> at org.elasticsearch.cluster.service.ClusterApplierService.applyChanges(ClusterApplierService.java:505) ~[elasticsearch-8.10.0-SNAPSHOT.jar:8.10.0-SNAPSHOT] 1> at org.elasticsearch.cluster.service.ClusterApplierService.runTask(ClusterApplierService.java:429) ~[elasticsearch-8.10.0-SNAPSHOT.jar:8.10.0-SNAPSHOT] 1> at org.elasticsearch.cluster.service.ClusterApplierService$UpdateTask.run(ClusterApplierService.java:154) ~[elasticsearch-8.10.0-SNAPSHOT.jar:8.10.0-SNAPSHOT] 1> at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:916) ~[elasticsearch-8.10.0-SNAPSHOT.jar:8.10.0-SNAPSHOT] 1> at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:217) ~[elasticsearch-8.10.0-SNAPSHOT.jar:8.10.0-SNAPSHOT] 1> at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:183) ~[elasticsearch-8.10.0-SNAPSHOT.jar:8.10.0-SNAPSHOT] 1> at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144) ~[?:?] 1> at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642) ~[?:?] ``` This changes checks for this situation and allows these downsample tasks to fail and cleanup. This is done by assigning it to a data node, so that when the operation is executed on that node it fails and cleaned up. (because the index couldn't be found). Relates to elastic#98764 (marking as non issue, since this is bug in non released code)
💚 Backport successful
|
…ask re-assignment. (#98769) (#98894) The result of this issue is that a persistent gets stuck forever. The following errors can occur in the es log: ``` [2023-08-22T17:35:08,733][WARN ][o.e.c.s.ClusterApplierService] [node_t0] failed to notify ClusterStateListener 1> org.elasticsearch.index.IndexNotFoundException: no such index [avrcruawcg] 1> at org.elasticsearch.cluster.routing.RoutingTable.shardRoutingTable(RoutingTable.java:142) ~[elasticsearch-8.10.0-SNAPSHOT.jar:8.10.0-SNAPSHOT] 1> at org.elasticsearch.xpack.downsample.DownsampleShardPersistentTaskExecutor.getAssignment(DownsampleShardPersistentTaskExecutor.java:113) ~[main/:?] 1> at org.elasticsearch.xpack.downsample.DownsampleShardPersistentTaskExecutor.getAssignment(DownsampleShardPersistentTaskExecutor.java:51) ~[main/:?] 1> at org.elasticsearch.persistent.PersistentTasksClusterService.createAssignment(PersistentTasksClusterService.java:351) ~[elasticsearch-8.10.0-SNAPSHOT.jar:8.10.0-SNAPSHOT] 1> at org.elasticsearch.persistent.PersistentTasksClusterService.shouldReassignPersistentTasks(PersistentTasksClusterService.java:438) ~[elasticsearch-8.10.0-SNAPSHOT.jar:8.10.0-SNAPSHOT] 1> at org.elasticsearch.persistent.PersistentTasksClusterService.clusterChanged(PersistentTasksClusterService.java:366) ~[elasticsearch-8.10.0-SNAPSHOT.jar:8.10.0-SNAPSHOT] 1> at org.elasticsearch.cluster.service.ClusterApplierService.callClusterStateListener(ClusterApplierService.java:560) ~[elasticsearch-8.10.0-SNAPSHOT.jar:8.10.0-SNAPSHOT] 1> at org.elasticsearch.cluster.service.ClusterApplierService.callClusterStateListeners(ClusterApplierService.java:546) ~[elasticsearch-8.10.0-SNAPSHOT.jar:8.10.0-SNAPSHOT] 1> at org.elasticsearch.cluster.service.ClusterApplierService.applyChanges(ClusterApplierService.java:505) ~[elasticsearch-8.10.0-SNAPSHOT.jar:8.10.0-SNAPSHOT] 1> at org.elasticsearch.cluster.service.ClusterApplierService.runTask(ClusterApplierService.java:429) ~[elasticsearch-8.10.0-SNAPSHOT.jar:8.10.0-SNAPSHOT] 1> at org.elasticsearch.cluster.service.ClusterApplierService$UpdateTask.run(ClusterApplierService.java:154) ~[elasticsearch-8.10.0-SNAPSHOT.jar:8.10.0-SNAPSHOT] 1> at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:916) ~[elasticsearch-8.10.0-SNAPSHOT.jar:8.10.0-SNAPSHOT] 1> at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:217) ~[elasticsearch-8.10.0-SNAPSHOT.jar:8.10.0-SNAPSHOT] 1> at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:183) ~[elasticsearch-8.10.0-SNAPSHOT.jar:8.10.0-SNAPSHOT] 1> at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144) ~[?:?] 1> at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642) ~[?:?] ``` This changes checks for this situation and allows these downsample tasks to fail and cleanup. This is done by assigning it to a data node, so that when the operation is executed on that node it fails and cleaned up. (because the index couldn't be found). Relates to #98764 (marking as non issue, since this is bug in non released code)
If as part of the persistent task assignment the source downsample index no longer exists, then the persistent task framework will continuously try to find an assignment and fail with IndexNotFoundException (which gets logged as a warning on elected master node). This fixes a bug in resolving the shard routing, so that if the index no longer exists any node is returned and the persistent task can fail gracefully at a later stage. The original fix via elastic#98769 didn't get this part right.
If as part of the persistent task assignment the source downsample index no longer exists, then the persistent task framework will continuously try to find an assignment and fail with IndexNotFoundException (which gets logged as a warning on elected master node). This fixes a bug in resolving the shard routing, so that if the index no longer exists any node is returned and the persistent task can fail gracefully at a later stage. The original fix via #98769 didn't get this part right.
If as part of the persistent task assignment the source downsample index no longer exists, then the persistent task framework will continuously try to find an assignment and fail with IndexNotFoundException (which gets logged as a warning on elected master node). This fixes a bug in resolving the shard routing, so that if the index no longer exists any node is returned and the persistent task can fail gracefully at a later stage. The original fix via elastic#98769 didn't get this part right.
If as part of the persistent task assignment the source downsample index no longer exists, then the persistent task framework will continuously try to find an assignment and fail with IndexNotFoundException (which gets logged as a warning on elected master node). This fixes a bug in resolving the shard routing, so that if the index no longer exists any node is returned and the persistent task can fail gracefully at a later stage. The original fix via #98769 didn't get this part right. Co-authored-by: Elastic Machine <elasticmachine@users.noreply.github.com>
The result of this issue is that a persistent gets stuck forever. The following errors can occur in the es log:
This changes checks for this situation and allows these downsample tasks to fail and cleanup. This is done by assigning it to a data node,
so that when the operation is executed on that node it fails and cleaned up. (because the index couldn't be found).
Relates to #98764
(marking as non issue, since this is bug in non released code)