Failed tasks and action type execution param objects remain as saved objects forever #55340

mikecote · 2020-01-20T16:52:07Z

When a task in task manager fails and will no longer use retry logic, it gets a status of failed and remains a saved object forever.

When executing an action type via task manager (ex from alerting) and it fails, the same applies for its task in task manager but also the same for the action_task_params object.

The text was updated successfully, but these errors were encountered:

elasticmachine · 2020-01-20T16:52:09Z

Pinging @elastic/kibana-alerting-services (Team:Alerting Services)

pmuellr · 2020-01-21T14:25:04Z

An additional complication for actions is that they can return a retry status, which should cause them to get retried. I don't remember what the logic is to stop doing retries, TM must have some max retry count or something.

Anyhoo, we'll need to figure out how to have TM delete the action_task_params. One thought is that we could use something like (or use directly) SO references - as a list of additional SO's that should be deleted when the task SO is deleted.

Another thought is that - I think when we added action_task_params we did it because we have an API key in there, using an ESO. TM tasks at the time did not at the time use ESO's, but am thinking perhaps now they do? Anyhoo, wondering if we can squirrel whatever bits are in the action_task_params into the task SO somehow.

mikecote · 2020-01-21T17:28:09Z

This could have a UI to allow the user to manage this as a feature.

pmuellr · 2020-01-22T22:35:00Z

Yes, per Bill's comment in the sync this week: "not deleting these is a feature" :-)

We'd need to build a UI to show them, delete them, etc. Sounds like work ...

gmmorris · 2021-06-22T09:52:08Z

Following this PR these Saved Object no longer remain forever, but rather they are cleaned up.
This task runs every 5m by default, though it has a backoff of an 1h when there are none to clean up.
Given this it's not safe to assume that if an action fails it won't leave any Saved Objects for ever, so this issue feel like it can be closed.

Any objection?

gmmorris · 2021-07-14T11:10:39Z

Following this PR these Saved Object no longer remain forever, but rather they are cleaned up.
This task runs every 5m by default, though it has a backoff of an 1h when there are none to clean up.
Given this it's not safe to assume that if an action fails it won't leave any Saved Objects for ever, so this issue feel like it can be closed.

Any objection?

I spoke to @mikecote and we feel this issue can remain open as it raises the more fundamental question of why these tasks remain as a failed task at all rather than closed at the time?
We should look into a process that no longer necessitates leaving these failed tasks behind (such as Event Log based record which enables us to handle/recreate them if needed)

mikecote · 2021-08-10T13:10:56Z

Adding this issue to 7.16/8.0 candidates as it would benefit the core team to addressing some debt early (#106991) if/when this issue resolves.

pmuellr · 2021-08-10T14:59:16Z

We should look into a process that no longer necessitates leaving these failed tasks behind (such as Event Log based record which enables us to handle/recreate them if needed)

Action failures are logged already in the action/execute event log documents (see below); that doc includes the error message, and also includes the id, typeId, and namespace (name would be good to add; but more generally, we should grow a kibana.connector field that can store all of this as "top-level fields" instead of the current "nested" field type of saved_objects). What it's missing is the actual connector parameters.

kibana/x-pack/plugins/actions/server/lib/action_executor.ts

Lines 233 to 253 in 9edcf9e

    
           if (result.status === 'ok') { 
        
             span?.setOutcome('success'); 
        
             event.event.outcome = 'success'; 
        
             event.message = `action executed: ${actionLabel}`; 
        
           } else if (result.status === 'error') { 
        
             span?.setOutcome('failure'); 
        
             event.event.outcome = 'failure'; 
        
             event.message = `action execution failure: ${actionLabel}`; 
        
             event.error = event.error || {}; 
        
             event.error.message = actionErrorToMessage(result); 
        
             logger.warn(`action execution failure: ${actionLabel}: ${event.error.message}`); 
        
           } else { 
        
             span?.setOutcome('failure'); 
        
             event.event.outcome = 'failure'; 
        
             event.message = `action execution returned unexpected result: ${actionLabel}: "${result.status}"`; 
        
             event.error = event.error || {}; 
        
             event.error.message = 'action execution returned unexpected result'; 
        
             logger.warn( 
        
               `action execution failure: ${actionLabel}: returned unexpected result "${result.status}"` 
        
             ); 
        
           }

enables us to handle/recreate them if needed

You mean, "enables a human to notice them and then manually do something" - or for code that does this, somehow. I could imagine a UI where you could see failed connector executions, and then retry them. That what you're thinking?

I'm thinking if all we need is the params for the connector - and probably the config, but not the secrets - it would be nice to put these in the event log (for failures only!) in an object enabled: false field (would be some new ECS field), so we could add them, but they wouldn't impact indexing, and hopefully not impact event log size (hoping there aren't a lot of these).

mikecote · 2021-08-10T18:55:28Z

@gmmorris ^^ regarding the last two paragraphs (in reply to your comment)

gmmorris · 2021-08-12T20:45:57Z

You mean, "enables a human to notice them and then manually do something" - or for code that does this, somehow. I could imagine a UI where you could see failed connector executions, and then retry them. That what you're thinking?

I don't know TBH... but yeah I was thinking in that general direction... perhaps a "retry" button next to each failure that tries to refire the action? 🤷
This is more of a product question IMHO...

chrisronline · 2021-08-18T20:37:01Z

I'm thinking if all we need is the params for the connector - and probably the config, but not the secrets - it would be nice to put these in the event log (for failures only!) in an object enabled: false field (would be some new ECS field), so we could add them, but they wouldn't impact indexing, and hopefully not impact event log size (hoping there aren't a lot of these).

Do we need this in the event log, if we can just query the saved objects for the same information? Or is there an potential issue accessing this in the event of an issue?

chrisronline · 2021-08-26T16:01:43Z

I synced a bit with @mikecote about this issue and I think we have a potential path forward.

Task manager will persist any and all failed tasks and this behavior is not something we necessarily want to change now. However, we do not want to persist failed action and action_task_params tasks since these are designed to run "at most once" and persisting failures is wasted space that eventually has performance implications with migrations.

To avoid needing to change any behavior in task manger directly, we can leverage the fact that task manager will remove all successful non-recurring tasks and simply pass off all action failures as a "success" to task manager. Of course, we still want to preserve existing behavior as much as possible (which primarily includes error/warning logs to the Kibana server log as well as failed entries in the event log) and this change will do that.

One point of note is retry logic. We currently support retry-able actions, but the action type itself needs to specify this and none of built-in action types do this today. This behavior should be preserved as well and we need to ensure that if the action task is retried >= the max attempts, we still do not store the failed task forever.

I did a PoC of this in this PR by modifying our existing email action type to support retry.

Flows

Current

Action task is scheduled
Action task params SO is created and stored
Action fails

Kibana server log written

  server    log   [14:03:19.541] [warning][actions][plugins] action execution failure: .email:48d2ddd0-0697-11ec-99f5-2d4d234a49e3: BadEmail: error sending email: connect ETIMEDOUT 45.11.57.36:343
  server    log   [14:03:19.542] [error][plugins][taskManager] Task actions:.email "b56e3610-0697-11ec-99f5-2d4d234a49e3" failed: Error: error sending email

Failed task and ATP SO persist

  {
    "_index" : ".kibana_task_manager_8.0.0_001",
    "_id" : "task:c5170010-0697-11ec-99f5-2d4d234a49e3",
    "_score" : null,
    "_source" : {
      "migrationVersion" : {
        "task" : "7.6.0"
      },
      "task" : {
        "taskType" : "actions:.email",
        "retryAt" : null,
        "runAt" : "2021-08-26T18:02:25.809Z",
        "scope" : [
          "actions"
        ],
        "traceparent" : "00-18d31579c4ab9a941bb306be84a6d814-d0ee6a7b26c1a281-01",
        "startedAt" : null,
        "state" : "{}",
        "params" : """{"spaceId":"default","actionTaskParamsId":"c4f3c0a0-0697-11ec-99f5-2d4d234a49e3"}""",
        "ownerId" : null,
        "scheduledAt" : "2021-08-26T18:02:25.809Z",
        "attempts" : 1,
        "status" : "failed"
      },
      "references" : [ ],
      "updated_at" : "2021-08-26T18:03:46.416Z",
      "coreMigrationVersion" : "8.0.0",
      "type" : "task"
    },
    "sort" : [
      1630001026416
    ]
  },

Suggested change

Action task is scheduled
Action task params SO is created and stored
Action fails

Kibana server log written

  server    log   [14:22:40.109] [warning][actions][plugins] action execution failure: .email:2786d200-069a-11ec-9c14-37a4468aefd3: Bad: error sending email: getaddrinfo ENOTFOUND bad@foobar.com
  server    log   [14:22:40.110] [error][actions][plugins] Action '2786d200-069a-11ec-9c14-37a4468aefd3' failed: error sending email

Failed task and ATP SO deleted
```
  {}
```

pmuellr · 2021-08-26T19:14:07Z

ya, that ^^^ works for me!

One slight change - add the space and action name to the error message. It's difficult for a user to go from connector id (or rule id) to the actual object in the UI - much easier with the name and space :-)

gmmorris · 2021-09-01T14:45:55Z

simply pass off all action failures as a "success" to task manager

This would mean that failed actions appear in the Task Manager health stats as successful....
So if Email actions fail 100% of the time they will appear in the health stats as 100% successful.

Can we live with that?
I feel like doing this, while not having a proper Event Log UI, means we will make it even harder for ourselves when diagnosing issues in customer deployments. 😟

gmmorris · 2021-09-02T08:57:59Z

simply pass off all action failures as a "success" to task manager

This would mean that failed actions appear in the Task Manager health stats as successful....
So if Email actions fail 100% of the time they will appear in the health stats as 100% successful.

Can we live with that?
I feel like doing this, while not having a proper Event Log UI, means we will make it even harder for ourselves when diagnosing issues in customer deployments. 😟

@mikecote @chrisronline and I discussed this synchronously.
Given the timelines for 7.16 and the risk added by changing TM at this point (this is a capacity risk, rather than a technical one. Working on this, means we don't work on other things), we've decided that we can live with the risk I mention above.

We agreed that:

Long term, failed Actions should appear as failed tasks in Task Manager health stats
A new issue will be filed to address the technical debt introduced by this change and we will aim to address this in early 8.x as we're concerned this change could actually make support a little harder.
We'll update the docs to make this clear and encourage users/support to use the Event Log when investigating actions, rather than rely on the TM health stats

Does this summary sound right @chrisronline @mikecote ?

mikecote · 2021-09-02T12:40:50Z

LGTM 👍

chrisronline · 2021-09-02T12:48:05Z

LGTM!

mikecote added Feature:Alerting Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) labels Jan 20, 2020

mikecote mentioned this issue Jan 20, 2020

Cleanup action task params objects after successful execution #55227

Merged

ymao1 added the discuss label Apr 23, 2021

gmmorris added the Feature:Alerting/RulesFramework Issues related to the Alerting Rules Framework label Jul 1, 2021

gmmorris added the loe:large Large Level of Effort label Jul 14, 2021

gmmorris added loe:needs-research This issue requires some research before it can be worked on or estimated and removed loe:large Large Level of Effort labels Jul 14, 2021

mikecote mentioned this issue Aug 5, 2021

Cleanup new excludeOnUpgrade Saved Object API #106991

Open

6 tasks

gmmorris added migration-resilience Issues related to migration resilience in terms of scale, performance & backwards compatibility estimate:needs-research Estimated as too large and requires research to break down into workable issues labels Aug 16, 2021

chrisronline self-assigned this Aug 18, 2021

chrisronline mentioned this issue Sep 1, 2021

[Actions] Treat failures as successes for Task Manager #109655

Merged

gmmorris removed the loe:needs-research This issue requires some research before it can be worked on or estimated label Sep 2, 2021

chrisronline linked a pull request Sep 2, 2021 that will close this issue

[Actions] Treat failures as successes for Task Manager #109655

Merged

chrisronline closed this as completed in #109655 Sep 9, 2021

chrisronline mentioned this issue Sep 9, 2021

[Task Manager] Re-evaluate how to handle failed action tasks in 8.x #111771

Closed

mikecote mentioned this issue Sep 16, 2021

More alerting services telemetry #60315

Closed

3 tasks

kobelb added the needs-team Issues missing a team label label Jan 31, 2022

botelastic bot removed the needs-team Issues missing a team label label Jan 31, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Failed tasks and action type execution param objects remain as saved objects forever #55340

Failed tasks and action type execution param objects remain as saved objects forever #55340

mikecote commented Jan 20, 2020

elasticmachine commented Jan 20, 2020

pmuellr commented Jan 21, 2020

mikecote commented Jan 21, 2020

pmuellr commented Jan 22, 2020

gmmorris commented Jun 22, 2021

gmmorris commented Jul 14, 2021

mikecote commented Aug 10, 2021

pmuellr commented Aug 10, 2021

mikecote commented Aug 10, 2021

gmmorris commented Aug 12, 2021 •

edited

Loading

chrisronline commented Aug 18, 2021

chrisronline commented Aug 26, 2021 •

edited

Loading

pmuellr commented Aug 26, 2021

gmmorris commented Sep 1, 2021

gmmorris commented Sep 2, 2021

mikecote commented Sep 2, 2021

chrisronline commented Sep 2, 2021

Failed tasks and action type execution param objects remain as saved objects forever #55340

Failed tasks and action type execution param objects remain as saved objects forever #55340

Comments

mikecote commented Jan 20, 2020

elasticmachine commented Jan 20, 2020

pmuellr commented Jan 21, 2020

mikecote commented Jan 21, 2020

pmuellr commented Jan 22, 2020

gmmorris commented Jun 22, 2021

gmmorris commented Jul 14, 2021

mikecote commented Aug 10, 2021

pmuellr commented Aug 10, 2021

mikecote commented Aug 10, 2021

gmmorris commented Aug 12, 2021 • edited Loading

chrisronline commented Aug 18, 2021

chrisronline commented Aug 26, 2021 • edited Loading

Flows

Current

Suggested change

pmuellr commented Aug 26, 2021

gmmorris commented Sep 1, 2021

gmmorris commented Sep 2, 2021

mikecote commented Sep 2, 2021

chrisronline commented Sep 2, 2021

gmmorris commented Aug 12, 2021 •

edited

Loading

chrisronline commented Aug 26, 2021 •

edited

Loading