[Discuss][Actions] Should actions remain "at most once"? #102888
Labels
discuss
estimate:needs-research
Estimated as too large and requires research to break down into workable issues
Feature:Actions/Framework
Issues related to the Actions Framework
Feature:Actions
Feature:Alerting
impact:high
Addressing this issue will have a high level of impact on the quality/strength of our product.
resilience
Issues related to Platform resilience in terms of scale, performance & backwards compatibility
Team:ResponseOps
Label for the ResponseOps team (formerly the Cases and Alerting teams)
When the actions plugin was introduced we chose to go with an "at most once" approach, meaning we wanted to ensure actions don't accidentally fire more than once.
We achieved this by setting action tasks to
maxAttempts
to1
, which means actions are never retried.The thinking at the time was that running an action twice was worse than not running it at all, as we had a history of failed tasks in the
.kibana_task_manager
index and we had concrete plans of building an Event Log UI before GA.Best laid plans of 🐭 and 👷♀️ and all , we're long past GA without an Event Log UI and following this PR these failed tasks are cleaned up every
5m
minutes.This means that even though actions are still "at most once", we no longer have the mitigation of users being able to investigate failed tasks. Failed actions are still logged in the Event Log, so it's not that there is no record, but we have no concrete plans for an Event Log UI and that tasks themselves are deleted so it isn't easy to reproduce the failure case.
This feels like a risk to me, and I think we should either reconsider the "at most once" guidance, or we should consider some kind of record of the failed tasks that can't cause migration failures further down the line.
Discuss :)
cc @elastic/kibana-alerting-services @stacey-gammon @kobelb
The text was updated successfully, but these errors were encountered: