Abort downsample persistent task if source index disappeared during task re-assignment. #98769

martijnvg · 2023-08-23T08:26:06Z

The result of this issue is that a persistent gets stuck forever. The following errors can occur in the es log:

[2023-08-22T17:35:08,733][WARN ][o.e.c.s.ClusterApplierService] [node_t0] failed to notify ClusterStateListener
  1> org.elasticsearch.index.IndexNotFoundException: no such index [avrcruawcg]
  1> 	at org.elasticsearch.cluster.routing.RoutingTable.shardRoutingTable(RoutingTable.java:142) ~[elasticsearch-8.10.0-SNAPSHOT.jar:8.10.0-SNAPSHOT]
  1> 	at org.elasticsearch.xpack.downsample.DownsampleShardPersistentTaskExecutor.getAssignment(DownsampleShardPersistentTaskExecutor.java:113) ~[main/:?]
  1> 	at org.elasticsearch.xpack.downsample.DownsampleShardPersistentTaskExecutor.getAssignment(DownsampleShardPersistentTaskExecutor.java:51) ~[main/:?]
  1> 	at org.elasticsearch.persistent.PersistentTasksClusterService.createAssignment(PersistentTasksClusterService.java:351) ~[elasticsearch-8.10.0-SNAPSHOT.jar:8.10.0-SNAPSHOT]
  1> 	at org.elasticsearch.persistent.PersistentTasksClusterService.shouldReassignPersistentTasks(PersistentTasksClusterService.java:438) ~[elasticsearch-8.10.0-SNAPSHOT.jar:8.10.0-SNAPSHOT]
  1> 	at org.elasticsearch.persistent.PersistentTasksClusterService.clusterChanged(PersistentTasksClusterService.java:366) ~[elasticsearch-8.10.0-SNAPSHOT.jar:8.10.0-SNAPSHOT]
  1> 	at org.elasticsearch.cluster.service.ClusterApplierService.callClusterStateListener(ClusterApplierService.java:560) ~[elasticsearch-8.10.0-SNAPSHOT.jar:8.10.0-SNAPSHOT]
  1> 	at org.elasticsearch.cluster.service.ClusterApplierService.callClusterStateListeners(ClusterApplierService.java:546) ~[elasticsearch-8.10.0-SNAPSHOT.jar:8.10.0-SNAPSHOT]
  1> 	at org.elasticsearch.cluster.service.ClusterApplierService.applyChanges(ClusterApplierService.java:505) ~[elasticsearch-8.10.0-SNAPSHOT.jar:8.10.0-SNAPSHOT]
  1> 	at org.elasticsearch.cluster.service.ClusterApplierService.runTask(ClusterApplierService.java:429) ~[elasticsearch-8.10.0-SNAPSHOT.jar:8.10.0-SNAPSHOT]
  1> 	at org.elasticsearch.cluster.service.ClusterApplierService$UpdateTask.run(ClusterApplierService.java:154) ~[elasticsearch-8.10.0-SNAPSHOT.jar:8.10.0-SNAPSHOT]
  1> 	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:916) ~[elasticsearch-8.10.0-SNAPSHOT.jar:8.10.0-SNAPSHOT]
  1> 	at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:217) ~[elasticsearch-8.10.0-SNAPSHOT.jar:8.10.0-SNAPSHOT]
  1> 	at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:183) ~[elasticsearch-8.10.0-SNAPSHOT.jar:8.10.0-SNAPSHOT]
  1> 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144) ~[?:?]
  1> 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642) ~[?:?]

This changes checks for this situation and allows these downsample tasks to fail and cleanup. This is done by assigning it to a data node,
so that when the operation is executed on that node it fails and cleaned up. (because the index couldn't be found).

Relates to #98764

(marking as non issue, since this is bug in non released code)

…ask re-assignment. The result of this issue is that a persistent gets stuck forever. The following errors can occur in the es log: ``` [2023-08-22T17:35:08,733][WARN ][o.e.c.s.ClusterApplierService] [node_t0] failed to notify ClusterStateListener 1> org.elasticsearch.index.IndexNotFoundException: no such index [avrcruawcg] 1> at org.elasticsearch.cluster.routing.RoutingTable.shardRoutingTable(RoutingTable.java:142) ~[elasticsearch-8.10.0-SNAPSHOT.jar:8.10.0-SNAPSHOT] 1> at org.elasticsearch.xpack.downsample.DownsampleShardPersistentTaskExecutor.getAssignment(DownsampleShardPersistentTaskExecutor.java:113) ~[main/:?] 1> at org.elasticsearch.xpack.downsample.DownsampleShardPersistentTaskExecutor.getAssignment(DownsampleShardPersistentTaskExecutor.java:51) ~[main/:?] 1> at org.elasticsearch.persistent.PersistentTasksClusterService.createAssignment(PersistentTasksClusterService.java:351) ~[elasticsearch-8.10.0-SNAPSHOT.jar:8.10.0-SNAPSHOT] 1> at org.elasticsearch.persistent.PersistentTasksClusterService.shouldReassignPersistentTasks(PersistentTasksClusterService.java:438) ~[elasticsearch-8.10.0-SNAPSHOT.jar:8.10.0-SNAPSHOT] 1> at org.elasticsearch.persistent.PersistentTasksClusterService.clusterChanged(PersistentTasksClusterService.java:366) ~[elasticsearch-8.10.0-SNAPSHOT.jar:8.10.0-SNAPSHOT] 1> at org.elasticsearch.cluster.service.ClusterApplierService.callClusterStateListener(ClusterApplierService.java:560) ~[elasticsearch-8.10.0-SNAPSHOT.jar:8.10.0-SNAPSHOT] 1> at org.elasticsearch.cluster.service.ClusterApplierService.callClusterStateListeners(ClusterApplierService.java:546) ~[elasticsearch-8.10.0-SNAPSHOT.jar:8.10.0-SNAPSHOT] 1> at org.elasticsearch.cluster.service.ClusterApplierService.applyChanges(ClusterApplierService.java:505) ~[elasticsearch-8.10.0-SNAPSHOT.jar:8.10.0-SNAPSHOT] 1> at org.elasticsearch.cluster.service.ClusterApplierService.runTask(ClusterApplierService.java:429) ~[elasticsearch-8.10.0-SNAPSHOT.jar:8.10.0-SNAPSHOT] 1> at org.elasticsearch.cluster.service.ClusterApplierService$UpdateTask.run(ClusterApplierService.java:154) ~[elasticsearch-8.10.0-SNAPSHOT.jar:8.10.0-SNAPSHOT] 1> at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:916) ~[elasticsearch-8.10.0-SNAPSHOT.jar:8.10.0-SNAPSHOT] 1> at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:217) ~[elasticsearch-8.10.0-SNAPSHOT.jar:8.10.0-SNAPSHOT] 1> at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:183) ~[elasticsearch-8.10.0-SNAPSHOT.jar:8.10.0-SNAPSHOT] 1> at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144) ~[?:?] 1> at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642) ~[?:?] ``` This changes checks for this situation and allows these downsample tasks to fail and cleanup. This is done by assigning it to a data node, so that when the operation is executed on that node it fails and cleaned up. (because the index couldn't be found). Relates to elastic#98764

elasticsearchmachine · 2023-08-23T08:26:32Z

Pinging @elastic/es-analytics-geo (Team:Analytics)

salvatore-campagna · 2023-08-25T15:29:25Z

.../src/main/java/org/elasticsearch/xpack/downsample/DownsampleShardPersistentTaskExecutor.java

+        // If during re-assignment the source index disappeared, then we need to break out.
+        // Returning NO_NODE_FOUND just keeps the persistent task until the source index appears again (which would never happen)
+        // So let's return a node and then in the node operation we would just fail and stop this persistent task
+        var indexShardRouting = clusterState.routingTable().shardRoutingTable(params.shardId());


I don't understand why we do this "before" the next if block.

Wouldn't it be better to do it inside the if block if we do not find any node? My understading is that if we try to find a node and we happen not to find any then it could be because the index is not there anymore.

Then maybe we can just add a check into the Downsample task which checks for the existence of the source index at the very beginning and fails with an "unrecoverable" error. In that case the executor would not retry right?

This way we can also terminate all the tasks running on each shard...?

Or maybe we don't even need this logic at all...

in the Downsample task, at the very beginning when fetching the initial state we could check if the index exists. If not we can fail with an "unrecoverable" error. That should stop the executor from indefinitely retrying right?

Wait I see what is the issue here...if the index is deleted there is no shard with index data -> there is no shard to start a task and we are not able to start a downsample task to check the existence of the index...so we just need to start the task on any available node holding that shard (there is no such shard because the index is deleted) just because we need to run the task to be able to realize the index is not there anymore.

I don't understand why we do this "before" the next if block.

This is because with: final ShardRouting shardRouting = clusterState.routingTable().shardRoutingTable(shardId).primaryShard();, this part shardRoutingTable(shardId) will fail. This is why I moved this up.

Wait I see what is the issue here...if the index is deleted there is no shard with index data -> there is no shard to start a task and we are not able to start a downsample task to check the existence of the index...

Yes, we're basically stuck forever. The getting an assignment fails with an exception and the persistent taks framework tries again on next cluster state update. But it will keep on failing because the index has been removed.

martijnvg · 2023-08-25T15:41:02Z

@elasticmachine update branch

…tasks

…ask re-assignment. (elastic#98769) The result of this issue is that a persistent gets stuck forever. The following errors can occur in the es log: ``` [2023-08-22T17:35:08,733][WARN ][o.e.c.s.ClusterApplierService] [node_t0] failed to notify ClusterStateListener 1> org.elasticsearch.index.IndexNotFoundException: no such index [avrcruawcg] 1> at org.elasticsearch.cluster.routing.RoutingTable.shardRoutingTable(RoutingTable.java:142) ~[elasticsearch-8.10.0-SNAPSHOT.jar:8.10.0-SNAPSHOT] 1> at org.elasticsearch.xpack.downsample.DownsampleShardPersistentTaskExecutor.getAssignment(DownsampleShardPersistentTaskExecutor.java:113) ~[main/:?] 1> at org.elasticsearch.xpack.downsample.DownsampleShardPersistentTaskExecutor.getAssignment(DownsampleShardPersistentTaskExecutor.java:51) ~[main/:?] 1> at org.elasticsearch.persistent.PersistentTasksClusterService.createAssignment(PersistentTasksClusterService.java:351) ~[elasticsearch-8.10.0-SNAPSHOT.jar:8.10.0-SNAPSHOT] 1> at org.elasticsearch.persistent.PersistentTasksClusterService.shouldReassignPersistentTasks(PersistentTasksClusterService.java:438) ~[elasticsearch-8.10.0-SNAPSHOT.jar:8.10.0-SNAPSHOT] 1> at org.elasticsearch.persistent.PersistentTasksClusterService.clusterChanged(PersistentTasksClusterService.java:366) ~[elasticsearch-8.10.0-SNAPSHOT.jar:8.10.0-SNAPSHOT] 1> at org.elasticsearch.cluster.service.ClusterApplierService.callClusterStateListener(ClusterApplierService.java:560) ~[elasticsearch-8.10.0-SNAPSHOT.jar:8.10.0-SNAPSHOT] 1> at org.elasticsearch.cluster.service.ClusterApplierService.callClusterStateListeners(ClusterApplierService.java:546) ~[elasticsearch-8.10.0-SNAPSHOT.jar:8.10.0-SNAPSHOT] 1> at org.elasticsearch.cluster.service.ClusterApplierService.applyChanges(ClusterApplierService.java:505) ~[elasticsearch-8.10.0-SNAPSHOT.jar:8.10.0-SNAPSHOT] 1> at org.elasticsearch.cluster.service.ClusterApplierService.runTask(ClusterApplierService.java:429) ~[elasticsearch-8.10.0-SNAPSHOT.jar:8.10.0-SNAPSHOT] 1> at org.elasticsearch.cluster.service.ClusterApplierService$UpdateTask.run(ClusterApplierService.java:154) ~[elasticsearch-8.10.0-SNAPSHOT.jar:8.10.0-SNAPSHOT] 1> at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:916) ~[elasticsearch-8.10.0-SNAPSHOT.jar:8.10.0-SNAPSHOT] 1> at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:217) ~[elasticsearch-8.10.0-SNAPSHOT.jar:8.10.0-SNAPSHOT] 1> at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:183) ~[elasticsearch-8.10.0-SNAPSHOT.jar:8.10.0-SNAPSHOT] 1> at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144) ~[?:?] 1> at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642) ~[?:?] ``` This changes checks for this situation and allows these downsample tasks to fail and cleanup. This is done by assigning it to a data node, so that when the operation is executed on that node it fails and cleaned up. (because the index couldn't be found). Relates to elastic#98764 (marking as non issue, since this is bug in non released code)

elasticsearchmachine · 2023-08-25T17:45:55Z

💚 Backport successful

Status	Branch	Result
✅	8.10

…ask re-assignment. (#98769) (#98894) The result of this issue is that a persistent gets stuck forever. The following errors can occur in the es log: ``` [2023-08-22T17:35:08,733][WARN ][o.e.c.s.ClusterApplierService] [node_t0] failed to notify ClusterStateListener 1> org.elasticsearch.index.IndexNotFoundException: no such index [avrcruawcg] 1> at org.elasticsearch.cluster.routing.RoutingTable.shardRoutingTable(RoutingTable.java:142) ~[elasticsearch-8.10.0-SNAPSHOT.jar:8.10.0-SNAPSHOT] 1> at org.elasticsearch.xpack.downsample.DownsampleShardPersistentTaskExecutor.getAssignment(DownsampleShardPersistentTaskExecutor.java:113) ~[main/:?] 1> at org.elasticsearch.xpack.downsample.DownsampleShardPersistentTaskExecutor.getAssignment(DownsampleShardPersistentTaskExecutor.java:51) ~[main/:?] 1> at org.elasticsearch.persistent.PersistentTasksClusterService.createAssignment(PersistentTasksClusterService.java:351) ~[elasticsearch-8.10.0-SNAPSHOT.jar:8.10.0-SNAPSHOT] 1> at org.elasticsearch.persistent.PersistentTasksClusterService.shouldReassignPersistentTasks(PersistentTasksClusterService.java:438) ~[elasticsearch-8.10.0-SNAPSHOT.jar:8.10.0-SNAPSHOT] 1> at org.elasticsearch.persistent.PersistentTasksClusterService.clusterChanged(PersistentTasksClusterService.java:366) ~[elasticsearch-8.10.0-SNAPSHOT.jar:8.10.0-SNAPSHOT] 1> at org.elasticsearch.cluster.service.ClusterApplierService.callClusterStateListener(ClusterApplierService.java:560) ~[elasticsearch-8.10.0-SNAPSHOT.jar:8.10.0-SNAPSHOT] 1> at org.elasticsearch.cluster.service.ClusterApplierService.callClusterStateListeners(ClusterApplierService.java:546) ~[elasticsearch-8.10.0-SNAPSHOT.jar:8.10.0-SNAPSHOT] 1> at org.elasticsearch.cluster.service.ClusterApplierService.applyChanges(ClusterApplierService.java:505) ~[elasticsearch-8.10.0-SNAPSHOT.jar:8.10.0-SNAPSHOT] 1> at org.elasticsearch.cluster.service.ClusterApplierService.runTask(ClusterApplierService.java:429) ~[elasticsearch-8.10.0-SNAPSHOT.jar:8.10.0-SNAPSHOT] 1> at org.elasticsearch.cluster.service.ClusterApplierService$UpdateTask.run(ClusterApplierService.java:154) ~[elasticsearch-8.10.0-SNAPSHOT.jar:8.10.0-SNAPSHOT] 1> at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:916) ~[elasticsearch-8.10.0-SNAPSHOT.jar:8.10.0-SNAPSHOT] 1> at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:217) ~[elasticsearch-8.10.0-SNAPSHOT.jar:8.10.0-SNAPSHOT] 1> at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:183) ~[elasticsearch-8.10.0-SNAPSHOT.jar:8.10.0-SNAPSHOT] 1> at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144) ~[?:?] 1> at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642) ~[?:?] ``` This changes checks for this situation and allows these downsample tasks to fail and cleanup. This is done by assigning it to a data node, so that when the operation is executed on that node it fails and cleaned up. (because the index couldn't be found). Relates to #98764 (marking as non issue, since this is bug in non released code)

If as part of the persistent task assignment the source downsample index no longer exists, then the persistent task framework will continuously try to find an assignment and fail with IndexNotFoundException (which gets logged as a warning on elected master node). This fixes a bug in resolving the shard routing, so that if the index no longer exists any node is returned and the persistent task can fail gracefully at a later stage. The original fix via elastic#98769 didn't get this part right.

If as part of the persistent task assignment the source downsample index no longer exists, then the persistent task framework will continuously try to find an assignment and fail with IndexNotFoundException (which gets logged as a warning on elected master node). This fixes a bug in resolving the shard routing, so that if the index no longer exists any node is returned and the persistent task can fail gracefully at a later stage. The original fix via #98769 didn't get this part right.

If as part of the persistent task assignment the source downsample index no longer exists, then the persistent task framework will continuously try to find an assignment and fail with IndexNotFoundException (which gets logged as a warning on elected master node). This fixes a bug in resolving the shard routing, so that if the index no longer exists any node is returned and the persistent task can fail gracefully at a later stage. The original fix via elastic#98769 didn't get this part right.

If as part of the persistent task assignment the source downsample index no longer exists, then the persistent task framework will continuously try to find an assignment and fail with IndexNotFoundException (which gets logged as a warning on elected master node). This fixes a bug in resolving the shard routing, so that if the index no longer exists any node is returned and the persistent task can fail gracefully at a later stage. The original fix via #98769 didn't get this part right. Co-authored-by: Elastic Machine <elasticmachine@users.noreply.github.com>

martijnvg added >non-issue :StorageEngine/Downsampling Downsampling (replacement for rollups) - Turn fine-grained time-based data into coarser-grained data v8.11.0 v8.10.1 labels Aug 23, 2023

elasticsearchmachine added the Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) label Aug 23, 2023

martijnvg requested a review from salvatore-campagna August 23, 2023 13:59

salvatore-campagna reviewed Aug 25, 2023

View reviewed changes

salvatore-campagna approved these changes Aug 25, 2023

View reviewed changes

Merge branch 'main' into downsample/cleanup_stuck_tasks

cabdd77

martijnvg added auto-merge Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) auto-backport-and-merge Automatically create backport pull requests and merge when ready labels Aug 25, 2023

martijnvg added 2 commits August 26, 2023 00:01

Merge remote-tracking branch 'es/main' into downsample/cleanup_stuck_…

931c08f

…tasks

iter

cd497b0

elasticsearchmachine merged commit cd62bc8 into elastic:main Aug 25, 2023
11 checks passed

martijnvg deleted the downsample/cleanup_stuck_tasks branch August 25, 2023 17:42

martijnvg mentioned this pull request Aug 25, 2023

[8.10] Abort downsample persistent task if source index disappeared during task re-assignment. (#98769) #98894

Merged

andreidan mentioned this pull request Aug 26, 2023

Initial data stream lifecycle support for downsampling #98609

Merged

JVerwolf added v8.10.0 and removed v8.10.1 labels Aug 31, 2023

martijnvg mentioned this pull request Mar 12, 2024

Fix a downsample persistent task assignment bug #106247

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Abort downsample persistent task if source index disappeared during task re-assignment. #98769

Abort downsample persistent task if source index disappeared during task re-assignment. #98769

martijnvg commented Aug 23, 2023

elasticsearchmachine commented Aug 23, 2023

salvatore-campagna Aug 25, 2023 •

edited

salvatore-campagna Aug 25, 2023

salvatore-campagna Aug 25, 2023 •

edited

martijnvg Aug 25, 2023

martijnvg Aug 25, 2023

martijnvg commented Aug 25, 2023

elasticsearchmachine commented Aug 25, 2023

Abort downsample persistent task if source index disappeared during task re-assignment. #98769

Abort downsample persistent task if source index disappeared during task re-assignment. #98769

Conversation

martijnvg commented Aug 23, 2023

elasticsearchmachine commented Aug 23, 2023

salvatore-campagna Aug 25, 2023 • edited

Choose a reason for hiding this comment

salvatore-campagna Aug 25, 2023

Choose a reason for hiding this comment

salvatore-campagna Aug 25, 2023 • edited

Choose a reason for hiding this comment

martijnvg Aug 25, 2023

Choose a reason for hiding this comment

martijnvg Aug 25, 2023

Choose a reason for hiding this comment

martijnvg commented Aug 25, 2023

elasticsearchmachine commented Aug 25, 2023

💚 Backport successful

salvatore-campagna Aug 25, 2023 •

edited

salvatore-campagna Aug 25, 2023 •

edited