Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failed tasks and action type execution param objects remain as saved objects forever #55340

Closed
mikecote opened this issue Jan 20, 2020 · 17 comments · Fixed by #109655
Closed
Assignees
Labels
discuss estimate:needs-research Estimated as too large and requires research to break down into workable issues Feature:Alerting/RulesFramework Issues related to the Alerting Rules Framework Feature:Alerting migration-resilience Issues related to migration resilience in terms of scale, performance & backwards compatibility Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams)

Comments

@mikecote
Copy link
Contributor

When a task in task manager fails and will no longer use retry logic, it gets a status of failed and remains a saved object forever.

When executing an action type via task manager (ex from alerting) and it fails, the same applies for its task in task manager but also the same for the action_task_params object.

@mikecote mikecote added Feature:Alerting Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) labels Jan 20, 2020
@elasticmachine
Copy link
Contributor

Pinging @elastic/kibana-alerting-services (Team:Alerting Services)

@pmuellr
Copy link
Member

pmuellr commented Jan 21, 2020

An additional complication for actions is that they can return a retry status, which should cause them to get retried. I don't remember what the logic is to stop doing retries, TM must have some max retry count or something.

Anyhoo, we'll need to figure out how to have TM delete the action_task_params. One thought is that we could use something like (or use directly) SO references - as a list of additional SO's that should be deleted when the task SO is deleted.

Another thought is that - I think when we added action_task_params we did it because we have an API key in there, using an ESO. TM tasks at the time did not at the time use ESO's, but am thinking perhaps now they do? Anyhoo, wondering if we can squirrel whatever bits are in the action_task_params into the task SO somehow.

@mikecote
Copy link
Contributor Author

This could have a UI to allow the user to manage this as a feature.

@pmuellr
Copy link
Member

pmuellr commented Jan 22, 2020

Yes, per Bill's comment in the sync this week: "not deleting these is a feature" :-)

We'd need to build a UI to show them, delete them, etc. Sounds like work ...

@ymao1 ymao1 added the discuss label Apr 23, 2021
@gmmorris
Copy link
Contributor

Following this PR these Saved Object no longer remain forever, but rather they are cleaned up.
This task runs every 5m by default, though it has a backoff of an 1h when there are none to clean up.
Given this it's not safe to assume that if an action fails it won't leave any Saved Objects for ever, so this issue feel like it can be closed.

Any objection?

@gmmorris gmmorris added the Feature:Alerting/RulesFramework Issues related to the Alerting Rules Framework label Jul 1, 2021
@gmmorris gmmorris added the loe:large Large Level of Effort label Jul 14, 2021
@gmmorris
Copy link
Contributor

Following this PR these Saved Object no longer remain forever, but rather they are cleaned up.
This task runs every 5m by default, though it has a backoff of an 1h when there are none to clean up.
Given this it's not safe to assume that if an action fails it won't leave any Saved Objects for ever, so this issue feel like it can be closed.

Any objection?

I spoke to @mikecote and we feel this issue can remain open as it raises the more fundamental question of why these tasks remain as a failed task at all rather than closed at the time?
We should look into a process that no longer necessitates leaving these failed tasks behind (such as Event Log based record which enables us to handle/recreate them if needed)

@gmmorris gmmorris added loe:needs-research This issue requires some research before it can be worked on or estimated and removed loe:large Large Level of Effort labels Jul 14, 2021
@mikecote
Copy link
Contributor Author

Adding this issue to 7.16/8.0 candidates as it would benefit the core team to addressing some debt early (#106991) if/when this issue resolves.

@pmuellr
Copy link
Member

pmuellr commented Aug 10, 2021

We should look into a process that no longer necessitates leaving these failed tasks behind (such as Event Log based record which enables us to handle/recreate them if needed)

Action failures are logged already in the action/execute event log documents (see below); that doc includes the error message, and also includes the id, typeId, and namespace (name would be good to add; but more generally, we should grow a kibana.connector field that can store all of this as "top-level fields" instead of the current "nested" field type of saved_objects). What it's missing is the actual connector parameters.

if (result.status === 'ok') {
span?.setOutcome('success');
event.event.outcome = 'success';
event.message = `action executed: ${actionLabel}`;
} else if (result.status === 'error') {
span?.setOutcome('failure');
event.event.outcome = 'failure';
event.message = `action execution failure: ${actionLabel}`;
event.error = event.error || {};
event.error.message = actionErrorToMessage(result);
logger.warn(`action execution failure: ${actionLabel}: ${event.error.message}`);
} else {
span?.setOutcome('failure');
event.event.outcome = 'failure';
event.message = `action execution returned unexpected result: ${actionLabel}: "${result.status}"`;
event.error = event.error || {};
event.error.message = 'action execution returned unexpected result';
logger.warn(
`action execution failure: ${actionLabel}: returned unexpected result "${result.status}"`
);
}

enables us to handle/recreate them if needed

You mean, "enables a human to notice them and then manually do something" - or for code that does this, somehow. I could imagine a UI where you could see failed connector executions, and then retry them. That what you're thinking?

I'm thinking if all we need is the params for the connector - and probably the config, but not the secrets - it would be nice to put these in the event log (for failures only!) in an object enabled: false field (would be some new ECS field), so we could add them, but they wouldn't impact indexing, and hopefully not impact event log size (hoping there aren't a lot of these).

@mikecote
Copy link
Contributor Author

@gmmorris ^^ regarding the last two paragraphs (in reply to your comment)

@gmmorris
Copy link
Contributor

gmmorris commented Aug 12, 2021

You mean, "enables a human to notice them and then manually do something" - or for code that does this, somehow. I could imagine a UI where you could see failed connector executions, and then retry them. That what you're thinking?

I don't know TBH... but yeah I was thinking in that general direction... perhaps a "retry" button next to each failure that tries to refire the action? 🤷
This is more of a product question IMHO...

@gmmorris gmmorris added migration-resilience Issues related to migration resilience in terms of scale, performance & backwards compatibility estimate:needs-research Estimated as too large and requires research to break down into workable issues labels Aug 16, 2021
@chrisronline chrisronline self-assigned this Aug 18, 2021
@chrisronline
Copy link
Contributor

I'm thinking if all we need is the params for the connector - and probably the config, but not the secrets - it would be nice to put these in the event log (for failures only!) in an object enabled: false field (would be some new ECS field), so we could add them, but they wouldn't impact indexing, and hopefully not impact event log size (hoping there aren't a lot of these).

Do we need this in the event log, if we can just query the saved objects for the same information? Or is there an potential issue accessing this in the event of an issue?

@chrisronline
Copy link
Contributor

chrisronline commented Aug 26, 2021

I synced a bit with @mikecote about this issue and I think we have a potential path forward.

Task manager will persist any and all failed tasks and this behavior is not something we necessarily want to change now. However, we do not want to persist failed action and action_task_params tasks since these are designed to run "at most once" and persisting failures is wasted space that eventually has performance implications with migrations.

To avoid needing to change any behavior in task manger directly, we can leverage the fact that task manager will remove all successful non-recurring tasks and simply pass off all action failures as a "success" to task manager. Of course, we still want to preserve existing behavior as much as possible (which primarily includes error/warning logs to the Kibana server log as well as failed entries in the event log) and this change will do that.

One point of note is retry logic. We currently support retry-able actions, but the action type itself needs to specify this and none of built-in action types do this today. This behavior should be preserved as well and we need to ensure that if the action task is retried >= the max attempts, we still do not store the failed task forever.

I did a PoC of this in this PR by modifying our existing email action type to support retry.

Flows

Current

  • Action task is scheduled

  • Action task params SO is created and stored

  • Action fails

  • Kibana server log written
      server    log   [14:03:19.541] [warning][actions][plugins] action execution failure: .email:48d2ddd0-0697-11ec-99f5-2d4d234a49e3: BadEmail: error sending email: connect ETIMEDOUT 45.11.57.36:343
      server    log   [14:03:19.542] [error][plugins][taskManager] Task actions:.email "b56e3610-0697-11ec-99f5-2d4d234a49e3" failed: Error: error sending email
    
  • Failed task and ATP SO persist
      {
        "_index" : ".kibana_task_manager_8.0.0_001",
        "_id" : "task:c5170010-0697-11ec-99f5-2d4d234a49e3",
        "_score" : null,
        "_source" : {
          "migrationVersion" : {
            "task" : "7.6.0"
          },
          "task" : {
            "taskType" : "actions:.email",
            "retryAt" : null,
            "runAt" : "2021-08-26T18:02:25.809Z",
            "scope" : [
              "actions"
            ],
            "traceparent" : "00-18d31579c4ab9a941bb306be84a6d814-d0ee6a7b26c1a281-01",
            "startedAt" : null,
            "state" : "{}",
            "params" : """{"spaceId":"default","actionTaskParamsId":"c4f3c0a0-0697-11ec-99f5-2d4d234a49e3"}""",
            "ownerId" : null,
            "scheduledAt" : "2021-08-26T18:02:25.809Z",
            "attempts" : 1,
            "status" : "failed"
          },
          "references" : [ ],
          "updated_at" : "2021-08-26T18:03:46.416Z",
          "coreMigrationVersion" : "8.0.0",
          "type" : "task"
        },
        "sort" : [
          1630001026416
        ]
      },
    

Suggested change

  • Action task is scheduled

  • Action task params SO is created and stored

  • Action fails

  • Kibana server log written
      server    log   [14:22:40.109] [warning][actions][plugins] action execution failure: .email:2786d200-069a-11ec-9c14-37a4468aefd3: Bad: error sending email: getaddrinfo ENOTFOUND bad@foobar.com
      server    log   [14:22:40.110] [error][actions][plugins] Action '2786d200-069a-11ec-9c14-37a4468aefd3' failed: error sending email
    
  • Failed task and ATP SO deleted
      {}
    

@pmuellr
Copy link
Member

pmuellr commented Aug 26, 2021

ya, that ^^^ works for me!

One slight change - add the space and action name to the error message. It's difficult for a user to go from connector id (or rule id) to the actual object in the UI - much easier with the name and space :-)

@gmmorris
Copy link
Contributor

gmmorris commented Sep 1, 2021

simply pass off all action failures as a "success" to task manager

This would mean that failed actions appear in the Task Manager health stats as successful....
So if Email actions fail 100% of the time they will appear in the health stats as 100% successful.

Can we live with that?
I feel like doing this, while not having a proper Event Log UI, means we will make it even harder for ourselves when diagnosing issues in customer deployments. 😟

@gmmorris
Copy link
Contributor

gmmorris commented Sep 2, 2021

simply pass off all action failures as a "success" to task manager

This would mean that failed actions appear in the Task Manager health stats as successful....
So if Email actions fail 100% of the time they will appear in the health stats as 100% successful.

Can we live with that?
I feel like doing this, while not having a proper Event Log UI, means we will make it even harder for ourselves when diagnosing issues in customer deployments. 😟

@mikecote @chrisronline and I discussed this synchronously.
Given the timelines for 7.16 and the risk added by changing TM at this point (this is a capacity risk, rather than a technical one. Working on this, means we don't work on other things), we've decided that we can live with the risk I mention above.

We agreed that:

  1. Long term, failed Actions should appear as failed tasks in Task Manager health stats
  2. A new issue will be filed to address the technical debt introduced by this change and we will aim to address this in early 8.x as we're concerned this change could actually make support a little harder.
  3. We'll update the docs to make this clear and encourage users/support to use the Event Log when investigating actions, rather than rely on the TM health stats

Does this summary sound right @chrisronline @mikecote ?

@gmmorris gmmorris removed the loe:needs-research This issue requires some research before it can be worked on or estimated label Sep 2, 2021
@mikecote
Copy link
Contributor Author

mikecote commented Sep 2, 2021

LGTM 👍

@chrisronline
Copy link
Contributor

LGTM!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discuss estimate:needs-research Estimated as too large and requires research to break down into workable issues Feature:Alerting/RulesFramework Issues related to the Alerting Rules Framework Feature:Alerting migration-resilience Issues related to migration resilience in terms of scale, performance & backwards compatibility Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams)
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants