Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Actions] Treat failures as successes for Task Manager #109655

Merged
merged 23 commits into from
Sep 9, 2021

Conversation

chrisronline
Copy link
Contributor

@chrisronline chrisronline commented Aug 23, 2021

Relates to #55340

This PR changes how action execution failures are handled for Task Manager. Prior to this PR, if action execution threw an exception or returned an error, Task Manager would attempt to retry the task until the maxAttempts for the task definition was reached. If unable to retry, the task would be marked as Failed and stored forever as such.

Persisting these forever is problematic, especially during upgrades when migrations are ran across every saved object. The amount of failure action task documents can grow exponentially based on the rule and number of actions assigned (especially considering that some rules allow "group by" where a single rule execution can spawn many action tasks). In addition to this, action_task_params saved objects are created alongside action task saved objects and are not removed if actions fail.

In #96971, we added a task that will clean up these failed action tasks, but we should really stop persisting these if we plan to just clean them up later anyways.

This PR aims to do that by telling Task Manager that the action was successful so it will remove it.

This does solve the problem as described above, but it also introduces new challenges, mainly that the task manager health api will report all failed actions as successful now which we will need to note in our documentation.

Testing

You need to create a scenario where actions are failing and ensure that failed actions aren't persisted in TM:

POST .kibana_task_manager*/_search
{
  "size": 0,
  "aggs": {
    "NAME": {
      "terms": {
        "field": "task.status",
        "size": 100
      }
    }
  }
}

@chrisronline chrisronline changed the title Only throw exceptions for failed tasks if retries are supported [Actions] Treat failures as successes for Task Manager Sep 1, 2021
@gmmorris
Copy link
Contributor

gmmorris commented Sep 2, 2021

This does solve the problem as described above, but it also introduces new challenges, mainly that the task manager health api will report all failed actions as successful now which we will need to note in our documentation.

To preempt a question which will likely come up in review:
As mentioned here we will address this in the future by:

  1. Rolling this change back, as the ideal behaviour is for failed actions to appear as failed tasks
  2. Extend Task Manager to allow tasks to fail in such a manner that Task Manager cleans up the task document despite having failed, thereby avoiding the migration issue.

@chrisronline chrisronline marked this pull request as ready for review September 2, 2021 13:13
@chrisronline chrisronline requested a review from a team as a code owner September 2, 2021 13:13
@chrisronline chrisronline added release_note:skip Skip the PR/issue when compiling release notes review Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) v7.16.0 v8.0.0 labels Sep 2, 2021
@elasticmachine
Copy link
Contributor

Pinging @elastic/kibana-alerting-services (Team:Alerting Services)

@chrisronline chrisronline added the Feature:Actions/Framework Issues related to the Actions Framework label Sep 2, 2021
@chrisronline
Copy link
Contributor Author

@elasticmachine merge upstream

Copy link
Contributor

@ymao1 ymao1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Verified that failed actions no longer leave task manager documents or action_task_params documents. Just a few comments about documentation and log messages. Nice work!

@@ -123,6 +124,14 @@ The root `status` indicates the `status` of the system overall.

The Runtime `status` indicates whether task executions have exceeded any of the <<task-manager-configuring-health-monitoring,configured health thresholds>>. An `OK` status means none of the threshold have been exceeded. A `Warning` status means that at least one warning threshold has been exceeded. An `Error` status means that at least one error threshold has been exceeded.

[IMPORTANT]
==============================================
Some tasks (such as <<action-types,connectors>>) will incorrectly report their status as successful even if the task failed.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if this part should be more generic since task manager can run any type of task, and we should add a section to the Actions and Connectors docs that specifically reference the event log.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are you thinking it should say? actions are the only known culprit of this and it does say Some tasks to make it seem like it's not an only action thing. WDYT?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm just wondering if it might lead people reading these docs to try to use the event log to look up failures for other tasks.

@gchaps Any suggestions for wording?

Copy link
Contributor

@gchaps gchaps Sep 3, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I need more context to provide wording.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh sorry, I missed this. I saw you added feedback. Did you still need more context?

x-pack/plugins/actions/server/lib/task_runner_factory.ts Outdated Show resolved Hide resolved
docs/management/action-types.asciidoc Outdated Show resolved Hide resolved
docs/management/action-types.asciidoc Outdated Show resolved Hide resolved
docs/management/action-types.asciidoc Outdated Show resolved Hide resolved
docs/management/action-types.asciidoc Outdated Show resolved Hide resolved
docs/management/action-types.asciidoc Outdated Show resolved Hide resolved
chrisronline and others added 6 commits September 7, 2021 13:50
Co-authored-by: gchaps <33642766+gchaps@users.noreply.github.com>
Co-authored-by: gchaps <33642766+gchaps@users.noreply.github.com>
Co-authored-by: gchaps <33642766+gchaps@users.noreply.github.com>
Co-authored-by: gchaps <33642766+gchaps@users.noreply.github.com>
Co-authored-by: gchaps <33642766+gchaps@users.noreply.github.com>
Copy link
Contributor

@YulNaumenko YulNaumenko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

@gchaps gchaps left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@chrisronline
Copy link
Contributor Author

@elasticmachine merge upstream

@kibanamachine
Copy link
Contributor

💚 Build Succeeded

Metrics [docs]

Public APIs missing comments

Total count of every public API that lacks a comment. Target amount is 0. Run node scripts/build_api_docs --plugin [yourplugin] --stats comments for more detailed information.

id before after diff
actions 122 123 +1

Public APIs missing exports

Total count of every type that is part of your API that should be exported but is not. This will cause broken links in the API documentation system. Target amount is 0. Run node scripts/build_api_docs --plugin [yourplugin] --stats exports for more detailed information.

id before after diff
actions 7 8 +1
Unknown metric groups

API count

id before after diff
actions 122 123 +1

History

To update your PR or re-run it, just comment with:
@elasticmachine merge upstream

@chrisronline chrisronline merged commit b9e6f93 into elastic:master Sep 9, 2021
@chrisronline chrisronline deleted the actions/failed_tasks branch September 9, 2021 16:51
chrisronline added a commit to chrisronline/kibana that referenced this pull request Sep 9, 2021
* Support retry with email as an example

* Fix tests

* Add logic to treat as failure if there is a retry

* Handle retry better

* Make this optional

* Tweaks

* Remove unnecessary code

* Fix existing tests

* Add some unit tests

* Add test

* Add doc note

* More docs

* PR feedback

* Update docs/management/action-types.asciidoc

Co-authored-by: gchaps <33642766+gchaps@users.noreply.github.com>

* Update docs/management/action-types.asciidoc

Co-authored-by: gchaps <33642766+gchaps@users.noreply.github.com>

* Update docs/management/action-types.asciidoc

Co-authored-by: gchaps <33642766+gchaps@users.noreply.github.com>

* Update docs/management/action-types.asciidoc

Co-authored-by: gchaps <33642766+gchaps@users.noreply.github.com>

* Update docs/management/action-types.asciidoc

Co-authored-by: gchaps <33642766+gchaps@users.noreply.github.com>

Co-authored-by: gchaps <33642766+gchaps@users.noreply.github.com>
Co-authored-by: Kibana Machine <42973632+kibanamachine@users.noreply.github.com>
chrisronline added a commit that referenced this pull request Sep 9, 2021
…1769)

* Support retry with email as an example

* Fix tests

* Add logic to treat as failure if there is a retry

* Handle retry better

* Make this optional

* Tweaks

* Remove unnecessary code

* Fix existing tests

* Add some unit tests

* Add test

* Add doc note

* More docs

* PR feedback

* Update docs/management/action-types.asciidoc

Co-authored-by: gchaps <33642766+gchaps@users.noreply.github.com>

* Update docs/management/action-types.asciidoc

Co-authored-by: gchaps <33642766+gchaps@users.noreply.github.com>

* Update docs/management/action-types.asciidoc

Co-authored-by: gchaps <33642766+gchaps@users.noreply.github.com>

* Update docs/management/action-types.asciidoc

Co-authored-by: gchaps <33642766+gchaps@users.noreply.github.com>

* Update docs/management/action-types.asciidoc

Co-authored-by: gchaps <33642766+gchaps@users.noreply.github.com>

Co-authored-by: gchaps <33642766+gchaps@users.noreply.github.com>
Co-authored-by: Kibana Machine <42973632+kibanamachine@users.noreply.github.com>

Co-authored-by: gchaps <33642766+gchaps@users.noreply.github.com>
Co-authored-by: Kibana Machine <42973632+kibanamachine@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Feature:Actions/Framework Issues related to the Actions Framework release_note:skip Skip the PR/issue when compiling release notes review Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) v7.16.0 v8.0.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Failed tasks and action type execution param objects remain as saved objects forever
7 participants