Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Discuss][Actions] Should actions remain "at most once"? #102888

Closed
gmmorris opened this issue Jun 22, 2021 · 4 comments
Closed

[Discuss][Actions] Should actions remain "at most once"? #102888

gmmorris opened this issue Jun 22, 2021 · 4 comments
Labels
discuss estimate:needs-research Estimated as too large and requires research to break down into workable issues Feature:Actions/Framework Issues related to the Actions Framework Feature:Actions Feature:Alerting impact:high Addressing this issue will have a high level of impact on the quality/strength of our product. resilience Issues related to Platform resilience in terms of scale, performance & backwards compatibility Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams)

Comments

@gmmorris
Copy link
Contributor

gmmorris commented Jun 22, 2021

When the actions plugin was introduced we chose to go with an "at most once" approach, meaning we wanted to ensure actions don't accidentally fire more than once.

We achieved this by setting action tasks to maxAttempts to 1, which means actions are never retried.
The thinking at the time was that running an action twice was worse than not running it at all, as we had a history of failed tasks in the .kibana_task_manager index and we had concrete plans of building an Event Log UI before GA.

Best laid plans of 🐭 and 👷‍♀️ and all , we're long past GA without an Event Log UI and following this PR these failed tasks are cleaned up every 5m minutes.

This means that even though actions are still "at most once", we no longer have the mitigation of users being able to investigate failed tasks. Failed actions are still logged in the Event Log, so it's not that there is no record, but we have no concrete plans for an Event Log UI and that tasks themselves are deleted so it isn't easy to reproduce the failure case.

This feels like a risk to me, and I think we should either reconsider the "at most once" guidance, or we should consider some kind of record of the failed tasks that can't cause migration failures further down the line.

Discuss :)

cc @elastic/kibana-alerting-services @stacey-gammon @kobelb

@gmmorris gmmorris added discuss Feature:Alerting Feature:Actions Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) labels Jun 22, 2021
@gmmorris gmmorris added Project:AlertingNotifyEfficiently Alerting team project for reducing the noise created by the alerting framework. Feature:Actions/Framework Issues related to the Actions Framework and removed Project:AlertingNotifyEfficiently Alerting team project for reducing the noise created by the alerting framework. labels Jun 30, 2021
@mikecote
Copy link
Contributor

cc @arisonl for product awareness on what we should do from a product perspective.

@mikecote mikecote added this to Backlog in Kibana Alerting Jul 28, 2021
@gmmorris gmmorris added the loe:needs-research This issue requires some research before it can be worked on or estimated label Aug 5, 2021
@kobelb
Copy link
Contributor

kobelb commented Aug 9, 2021

I'm largely putting on a product hat here, but I think we should allow our users to choose between "at most once" and "at least once" delivery for actions on a per-action basis. Depending on the situation, I anticipate users wanting the ability to customize how this works for each different alerting rule.

@gmmorris gmmorris added resilience Issues related to Platform resilience in terms of scale, performance & backwards compatibility estimate:needs-research Estimated as too large and requires research to break down into workable issues labels Aug 16, 2021
@gmmorris gmmorris removed the loe:needs-research This issue requires some research before it can be worked on or estimated label Sep 2, 2021
@gmmorris gmmorris added the impact:high Addressing this issue will have a high level of impact on the quality/strength of our product. label Sep 16, 2021
@gmmorris
Copy link
Contributor Author

gmmorris commented Oct 6, 2021

This relates to the long running actions issue as well: #113424

@XavierM XavierM removed this from Backlog in Kibana Alerting Jan 6, 2022
@kobelb kobelb added the needs-team Issues missing a team label label Jan 31, 2022
@botelastic botelastic bot removed the needs-team Issues missing a team label label Jan 31, 2022
@mikecote
Copy link
Contributor

Closing in favour of #143046

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discuss estimate:needs-research Estimated as too large and requires research to break down into workable issues Feature:Actions/Framework Issues related to the Actions Framework Feature:Actions Feature:Alerting impact:high Addressing this issue will have a high level of impact on the quality/strength of our product. resilience Issues related to Platform resilience in terms of scale, performance & backwards compatibility Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams)
Projects
None yet
Development

No branches or pull requests

3 participants