Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Transform] Unattended are failing due to missing configuration #107266

Open
Tracked by #107251
prwhelan opened this issue Apr 9, 2024 · 5 comments · May be fixed by #107917
Open
Tracked by #107251

[Transform] Unattended are failing due to missing configuration #107266

prwhelan opened this issue Apr 9, 2024 · 5 comments · May be fixed by #107917
Assignees
Labels
>bug :ml/Transform Transform Team:ML Meta label for the ML team v8.15.0

Comments

@prwhelan
Copy link
Member

prwhelan commented Apr 9, 2024

Description

Issue seen ~1-3 times per week

https://github.com/elastic/elasticsearch/blob/main/x-pack/plugin/transform/src/main/java/org/elasticsearch/xpack/transform/transforms/TransformIndexer.java#L385-L386

If the transform config index or the transform config is gone, something serious occurred
We are in an unknown state and should fail out

org.elasticsearch.ResourceNotFoundException: Transform with id [endpoint.metadata_united-default-8.13.0] could not be found
	at org.elasticsearch.transform@8.14.0/org.elasticsearch.xpack.transform.persistence.IndexBasedTransformConfigManager.lambda$getTransformConfiguration$9(IndexBasedTransformConfigManager.java:433)
	at org.elasticsearch.server@8.14.0/org.elasticsearch.action.ActionListenerImplementations$ResponseWrappingActionListener.onResponse(ActionListenerImplementations.java:245)
	at org.elasticsearch.server@8.14.0/org.elasticsearch.action.support.ContextPreservingActionListener.onResponse(ContextPreservingActionListener.java:32)
	at org.elasticsearch.server@8.14.0/org.elasticsearch.tasks.TaskManager$1.onResponse(TaskManager.java:202)
	at org.elasticsearch.server@8.14.0/org.elasticsearch.tasks.TaskManager$1.onResponse(TaskManager.java:196)
	at org.elasticsearch.server@8.14.0/org.elasticsearch.action.ActionListenerImplementations$RunBeforeActionListener.onResponse(ActionListenerImplementations.java:307)
	at org.elasticsearch.server@8.14.0/org.elasticsearch.action.support.ContextPreservingActionListener.onResponse(ContextPreservingActionListener.java:32)
	at org.elasticsearch.server@8.14.0/org.elasticsearch.action.ActionListenerImplementations$MappedActionListener.onResponse(ActionListenerImplementations.java:95)
	at org.elasticsearch.server@8.14.0/org.elasticsearch.action.search.TransportSearchAction.lambda$doExecute$2(TransportSearchAction.java:308)
	at org.elasticsearch.server@8.14.0/org.elasticsearch.action.ActionListenerImplementations$ResponseWrappingActionListener.onResponse(ActionListenerImplementations.java:245)
	at org.elasticsearch.server@8.14.0/org.elasticsearch.action.ActionListenerImplementations$RunAfterActionListener.onResponse(ActionListenerImplementations.java:269)
	at org.elasticsearch.server@8.14.0/org.elasticsearch.action.ActionListener.respondAndRelease(ActionListener.java:289)
	at org.elasticsearch.server@8.14.0/org.elasticsearch.action.search.AbstractSearchAsyncAction.sendSearchResponse(AbstractSearchAsyncAction.java:706)
	at org.elasticsearch.server@8.14.0/org.elasticsearch.action.search.FetchLookupFieldsPhase.run(FetchLookupFieldsPhase.java:75)
	at org.elasticsearch.server@8.14.0/org.elasticsearch.action.search.AbstractSearchAsyncAction.executePhase(AbstractSearchAsyncAction.java:454)
	at org.elasticsearch.server@8.14.0/org.elasticsearch.action.search.AbstractSearchAsyncAction.executeNextPhase(AbstractSearchAsyncAction.java:448)
	at org.elasticsearch.server@8.14.0/org.elasticsearch.action.search.ExpandSearchPhase.onPhaseDone(ExpandSearchPhase.java:153)
	at org.elasticsearch.server@8.14.0/org.elasticsearch.action.search.ExpandSearchPhase.run(ExpandSearchPhase.java:54)
	at org.elasticsearch.server@8.14.0/org.elasticsearch.action.search.AbstractSearchAsyncAction.executePhase(AbstractSearchAsyncAction.java:454)
	at org.elasticsearch.server@8.14.0/org.elasticsearch.action.search.AbstractSearchAsyncAction.executeNextPhase(AbstractSearchAsyncAction.java:448)
	at org.elasticsearch.server@8.14.0/org.elasticsearch.action.search.FetchSearchPhase.moveToNextPhase(FetchSearchPhase.java:235)
	at org.elasticsearch.server@8.14.0/org.elasticsearch.action.search.FetchSearchPhase.innerRun(FetchSearchPhase.java:108)
	at org.elasticsearch.server@8.14.0/org.elasticsearch.action.search.FetchSearchPhase$1.doRun(FetchSearchPhase.java:87)
	at org.elasticsearch.server@8.14.0/org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:984)
	at org.elasticsearch.server@8.14.0/org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:26)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
	at java.base/java.lang.Thread.run(Thread.java:1570)
org.elasticsearch.xpack.transform.transforms.TransformIndexer$TransformConfigLostOnReloadException: Failed to reload transform configuration for transform [endpoint.metadata_united-default-8.13.0]
	at org.elasticsearch.transform@8.14.0/org.elasticsearch.xpack.transform.transforms.TransformIndexer.lambda$onStart$19(TransformIndexer.java:383)
	at org.elasticsearch.server@8.14.0/org.elasticsearch.action.ActionListenerImplementations.safeAcceptException(ActionListenerImplementations.java:62)
	at org.elasticsearch.server@8.14.0/org.elasticsearch.action.ActionListener$2.onFailure(ActionListener.java:179)
	at org.elasticsearch.transform@8.14.0/org.elasticsearch.xpack.transform.persistence.IndexBasedTransformConfigManager.lambda$getTransformConfiguration$9(IndexBasedTransformConfigManager.java:432)
	at org.elasticsearch.server@8.14.0/org.elasticsearch.action.ActionListenerImplementations$ResponseWrappingActionListener.onResponse(ActionListenerImplementations.java:245)
	at org.elasticsearch.server@8.14.0/org.elasticsearch.action.support.ContextPreservingActionListener.onResponse(ContextPreservingActionListener.java:32)
	at org.elasticsearch.server@8.14.0/org.elasticsearch.tasks.TaskManager$1.onResponse(TaskManager.java:202)
	at org.elasticsearch.server@8.14.0/org.elasticsearch.tasks.TaskManager$1.onResponse(TaskManager.java:196)
	at org.elasticsearch.server@8.14.0/org.elasticsearch.action.ActionListenerImplementations$RunBeforeActionListener.onResponse(ActionListenerImplementations.java:307)
	at org.elasticsearch.server@8.14.0/org.elasticsearch.action.support.ContextPreservingActionListener.onResponse(ContextPreservingActionListener.java:32)
	at org.elasticsearch.server@8.14.0/org.elasticsearch.action.ActionListenerImplementations$MappedActionListener.onResponse(ActionListenerImplementations.java:95)
	at org.elasticsearch.server@8.14.0/org.elasticsearch.action.search.TransportSearchAction.lambda$doExecute$2(TransportSearchAction.java:308)
	at org.elasticsearch.server@8.14.0/org.elasticsearch.action.ActionListenerImplementations$ResponseWrappingActionListener.onResponse(ActionListenerImplementations.java:245)
	at org.elasticsearch.server@8.14.0/org.elasticsearch.action.ActionListenerImplementations$RunAfterActionListener.onResponse(ActionListenerImplementations.java:269)
	at org.elasticsearch.server@8.14.0/org.elasticsearch.action.ActionListener.respondAndRelease(ActionListener.java:289)
	at org.elasticsearch.server@8.14.0/org.elasticsearch.action.search.AbstractSearchAsyncAction.sendSearchResponse(AbstractSearchAsyncAction.java:706)
	at org.elasticsearch.server@8.14.0/org.elasticsearch.action.search.FetchLookupFieldsPhase.run(FetchLookupFieldsPhase.java:75)
	at org.elasticsearch.server@8.14.0/org.elasticsearch.action.search.AbstractSearchAsyncAction.executePhase(AbstractSearchAsyncAction.java:454)
	at org.elasticsearch.server@8.14.0/org.elasticsearch.action.search.AbstractSearchAsyncAction.executeNextPhase(AbstractSearchAsyncAction.java:448)
	at org.elasticsearch.server@8.14.0/org.elasticsearch.action.search.ExpandSearchPhase.onPhaseDone(ExpandSearchPhase.java:153)
	at org.elasticsearch.server@8.14.0/org.elasticsearch.action.search.ExpandSearchPhase.run(ExpandSearchPhase.java:54)
	at org.elasticsearch.server@8.14.0/org.elasticsearch.action.search.AbstractSearchAsyncAction.executePhase(AbstractSearchAsyncAction.java:454)
	at org.elasticsearch.server@8.14.0/org.elasticsearch.action.search.AbstractSearchAsyncAction.executeNextPhase(AbstractSearchAsyncAction.java:448)
	at org.elasticsearch.server@8.14.0/org.elasticsearch.action.search.FetchSearchPhase.moveToNextPhase(FetchSearchPhase.java:235)
	at org.elasticsearch.server@8.14.0/org.elasticsearch.action.search.FetchSearchPhase.innerRun(FetchSearchPhase.java:108)
	at org.elasticsearch.server@8.14.0/org.elasticsearch.action.search.FetchSearchPhase$1.doRun(FetchSearchPhase.java:87)
	at org.elasticsearch.server@8.14.0/org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:984)
	at org.elasticsearch.server@8.14.0/org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:26)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
	at java.base/java.lang.Thread.run(Thread.java:1570)
Caused by: org.elasticsearch.ResourceNotFoundException: Transform with id [endpoint.metadata_united-default-8.13.0] could not be found
	at org.elasticsearch.transform@8.14.0/org.elasticsearch.xpack.transform.persistence.IndexBasedTransformConfigManager.lambda$getTransformConfiguration$9(IndexBasedTransformConfigManager.java:433)
	... 27 more
org.elasticsearch.transport.RemoteTransportException: [<>][<>][cluster:admin/persistent/update_status]
Caused by: org.elasticsearch.ResourceNotFoundException: the task with id endpoint.metadata_united-default-8.13.0 and allocation id <> doesn't exist
	at org.elasticsearch.persistent.PersistentTasksClusterService$4.execute(PersistentTasksClusterService.java:264)
	at org.elasticsearch.cluster.service.MasterService$UnbatchedExecutor.execute(MasterService.java:550)
	at org.elasticsearch.cluster.service.MasterService.innerExecuteTasks(MasterService.java:1039)
	at org.elasticsearch.cluster.service.MasterService.executeTasks(MasterService.java:1004)
	at org.elasticsearch.cluster.service.MasterService.executeAndPublishBatch(MasterService.java:232)
	at org.elasticsearch.cluster.service.MasterService$BatchingTaskQueue$Processor.lambda$run$2(MasterService.java:1645)
	at org.elasticsearch.action.ActionListener.run(ActionListener.java:356)
	at org.elasticsearch.cluster.service.MasterService$BatchingTaskQueue$Processor.run(MasterService.java:1642)
	at org.elasticsearch.cluster.service.MasterService$5.lambda$doRun$0(MasterService.java:1237)
	at org.elasticsearch.action.ActionListener.run(ActionListener.java:356)
	at org.elasticsearch.cluster.service.MasterService$5.doRun(MasterService.java:1216)
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:984)
	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:26)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
	at java.lang.Thread.run(Thread.java:1570)

Next steps:

  • Determine if this is an issue
    • Is the Transform supposed to be running?
    • Is there a deployment happening at the time?
@prwhelan prwhelan added >bug :ml/Transform Transform Team:ML Meta label for the ML team v8.14.0 labels Apr 9, 2024
@prwhelan prwhelan self-assigned this Apr 9, 2024
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/ml-core (Team:ML)

@prwhelan
Copy link
Member Author

This may be resolved when we introduce the change to abort failing transforms during cluster restarts: #100891

The only thing I can think of is that we have two threads at work, thread1 is removing the persistent task during a node shtudown, and thread2 is trying to update the transform configuration. At least with #100891, the error won't fail the transform, and instead we'll retry it on another node (or the same node when it comes back online)

@prwhelan
Copy link
Member Author

The above was only partially true, there is another set that seem to be followed by a Transform delete:

[endpoint.metadata_united-default-8.14.0] deleted transform

But the Indexer thread is still running and will eventually fail and error out.

@prwhelan
Copy link
Member Author

Seems to come from this: https://github.com/elastic/kibana/blob/main/x-pack/plugins/fleet/server/services/epm/elasticsearch/transform/remove.ts#L27

So likely we're calling the Stop API beforehand, or at least we are calling it as part of the delete API (via force=true).

@przemekwitek
Copy link
Contributor

So likely we're calling the Stop API beforehand, or at least we are calling it as part of the delete API (via force=true).

Yes, the backend is calling _stop as part of DELETE.
See the TransportDeleteTransformAction.masterOperation method.

prwhelan added a commit to prwhelan/elasticsearch that referenced this issue Apr 23, 2024
When `_stop?wait_for_checkpoint=false` and
`_stop?force=true&wait_for_checkpoint=false` are called, there is a
small chance that the Transform Indexer thread will run if it is
scheduled before the stop API is called but before the threadpool runs
the executable. The `onStart` method now checks the state of the
indexer before executing. This will mitigate errors caused by reading
from Transform internal indices while the Task is stopped or deleted.

This does not impact when `wait_for_checkpoint=true`, because the
indexer state will remain `INDEXING` until the checkpoint is finished.

Relate elastic#107266
prwhelan added a commit that referenced this issue Apr 25, 2024
When `_stop?wait_for_checkpoint=false` and
`_stop?force=true&wait_for_checkpoint=false` are called, there is a
small chance that the Transform Indexer thread will run if it is
scheduled before the stop API is called but before the threadpool runs
the executable. The `onStart` method now checks the state of the
indexer before executing. This will mitigate errors caused by reading
from Transform internal indices while the Task is stopped or deleted.

This does not impact when `wait_for_checkpoint=true`, because the
indexer state will remain `INDEXING` until the checkpoint is finished.

Relate #107266
prwhelan added a commit to prwhelan/elasticsearch that referenced this issue Apr 25, 2024
Check if the Transform was aborted before failing due to missing
Transform config.

If the `DELETE _transform/id` API is called while the Indexer
is looking up the Config, it is possible the delete API will remove the
Config before the Indexer can retrieve the Config.  Rather than fail the
Transform, the indexer will check if the delete API has been called via
the `ABORTING` state and move into its graceful shutdown sequence.

Fix elastic#107266
@prwhelan prwhelan linked a pull request Apr 25, 2024 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug :ml/Transform Transform Team:ML Meta label for the ML team v8.15.0
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants