Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The execution of actions is often delayed after a rule schedule them due to capacity and persistence overhead #90888

Closed
mikecote opened this issue Feb 10, 2021 · 7 comments · Fixed by #97311
Assignees
Labels
Feature:Actions Feature:Alerting/RuleActions Issues related to the Actions attached to Rules on the Alerting Framework Feature:Alerting/RulesFramework Issues related to the Alerting Rules Framework Feature:Alerting Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams)

Comments

@mikecote
Copy link
Contributor

mikecote commented Feb 10, 2021

There is a lot of overhead running alert actions, each in their own task. It can also clog up the task manager queue when many alert actions need to be scheduled or be hard to serialize when the payload is large. We should look at a way to eliminate the usage of tasks by the actions plugin and run in the same execution as the alert. This will allow a lot of code cleanup (action_task_params, task runner, etc.) and help with the overall performance.

Since the retry logic doesn't work at this time (#79169), we could consider the model "at most once" for now and worry about retry logic as the requirements arrive.

@mikecote mikecote added Feature:Alerting Feature:Actions Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) labels Feb 10, 2021
@elasticmachine
Copy link
Contributor

Pinging @elastic/kibana-alerting-services (Team:Alerting Services)

@mikecote
Copy link
Contributor Author

I think this issue would be a good one to work on in ~7.14.

@pmuellr
Copy link
Member

pmuellr commented Feb 10, 2021

This will provide the opportunity to more easily "pipeline" actions - run a Jira action to create an issue, where the issue #/link are returned in the action result. Which can then - somehow - be passed to subsequent actions, like an email/slack action that would include a link to the Jira issue. Still lots of design and code to get there, but running actions in the task as the alert makes this a lot easier than the current architecture.

It's also interesting because today actions only have access to the "context variables" the alert provided, which are serialized JSON. If we're running in the same flow as alerts, there's a possibility we could make more interesting alert data available - presumably via function access. As a future enhancement.

Likewise, on the "output" end of actions, we should be able to capture more error diagnostic information that the alert could know about. Today that's difficult because the alert just queues the actions to run and then never gets their results.

There is likely a lot of optimization we can make when we have to run the same actions for multiple instances. Security checks and what-not. Probably as a follow-on PR to just calling the actions via the action client today.

@gmmorris
Copy link
Contributor

One thing to consider in this regard is that the next run of the Alert won't happen until all actions have executed - I'm not sure that's the behavior we want.

Perhaps the approach should be to group all actions in a single task (rather than a task per action, as we have now), but keep that task separate from the Alert task?

@mikecote mikecote added this to 7.14 - Tentative in Kibana Alerting Feb 16, 2021
@mikecote mikecote moved this from 7.14 - Tentative to To-Do (Ordered by priority) in Kibana Alerting Mar 9, 2021
@chrisronline chrisronline self-assigned this Apr 5, 2021
@gmmorris gmmorris moved this from To-Do (Ordered by priority) to In Progress in Kibana Alerting Apr 7, 2021
@chrisronline
Copy link
Contributor

Can we elaborate on the various scenarios we should ensure still work properly with any change to this behavior?

For example, if we decide to handle action execution within the same task, it might mean we lose out on multi-Kibana alert execution (as in the single Kibana who picked up the task must execute all actions) - is that something we can't change?

Also, AFAIK, we do not have any sort of retry logic in place if the action task if it fails, and I'm assuming we don't need to change this behavior, but do we need to leave the door open in the future for retries? Or is that not something we've committed to doing?

@gmmorris
Copy link
Contributor

gmmorris commented Apr 7, 2021

Can we elaborate on the various scenarios we should ensure still work properly with any change to this behavior?

There's probably a bunch, but here's what crosses my mind at this moment in time:

  1. Long running actions shouldn't cause the Rule to execute any later than they would have if off loaded to another task. But I don't love the idea of just spinning off some code in setTimeout(0) 😱 , so figuring out how to do this safely, with visibility into what happens and what's running is quite important. We can perhaps do something incremental where we do the work manually async now, but in the future utilise a Task Manager Worker to do work without persisting it into the queue?

  2. I think we currently spin off a task for each Alert ("Alert Instance" in the code). That means that if the action fails for one alert, we still get the action executing for another Alert. We probably want to make sure that behaviour is retained.

  3. If I'm right about the task-per-Alert thing, then this also means we get some degree of concurrency at the moment, as tasks can be parallelised across Kibana (and across workers in the same Kibana [as much as node.js can parallelise... you know what I mean]). Do we now turn it into a queue executed one by one? Or do we try to utilise some concurrency? In terms of "things to look out for", I guess we need to look out for how this behaviour differs from what we have now. We can then discuss how it should be. Just because we did it one way before , doesn't mean it should stay the same. :)

For example, if we decide to handle action execution within the same task, it might mean we lose out on multi-Kibana alert execution (as in the single Kibana who picked up the task must execute all actions) - is that something we can't change?

Sigh, this is tricky.
We lose the ability to share the load between Kibana... which I think is OK as a starting point. But we will likely have to revisit this... perhaps we do batch up the work for later if there are more than some threshold of Alerts? Or, alternatively, we enable Task Manager to execute tasks without persisting them, so it's always handled by one Kibana, but at least it ensures we don't swamp this Kibana with more Actions and Rules than it can handle and we work our way through the queue of Actions that way?

Also, AFAIK, we do not have any sort of retry logic in place if the action task if it fails, and I'm assuming we don't need to change this behavior, but do we need to leave the door open in the future for retries? Or is that not something we've committed to doing?

Agreed, we don't need to introduce new retry logic, but we might want to in the future.
Note my 2nd point above - we don't have retry, but each Alert is handled by its own task so one failing Alert doesn't cause the other to fail. We would want to retain that independence between Alerts I think.

One general note:
It would be good to explore some options, including incremental steps that can get us there.
It's fine if this take several PRs to achieve (even across minors), as long as we have a clear goal in mind and can articulate the value it provides.

@pmuellr
Copy link
Member

pmuellr commented Apr 7, 2021

Bunch of rando thoughts below, but I feel like there's a bunch of unknown unknowns, so I think hacking at it to get it to basically work is the way to start, and then see what happens. Our stress tests invoke actions, right? I think maybe just server log though? Will be good enough to start with, make sure even a basic implementation isn't breaking something.

... rando thoughts ...

Yes, let's not spin off tasks with setTimeout(0) if we can avoid it, since they aren't terribly observable. If we can find places to do parallel work, we can invoke via an async function and collect the promise for later waiting. And I guess we should make use of the new AbortSignal stuff, so we could "in theory" cancel a running task.

I think we currently spin off a task for each Alert ("Alert Instance" in the code).

My memory is that we spin off a task for each action - so if you have 10 alert instances that need to call 3 actions, that's 30 tasks. And we basically just queue them all to run independently, at the same time.

Longer-term, it would be nice to serialize those 3 actions, so the 2nd isn't invoked till the 1st is done, and the 3rd isn't invoked till the 2nd is done. Because we'll want to be able to feed the output from previous actions as input to the subsequent actions. That's all we know about that - not clear HOW we pass that data! But it does imply there's some ordering, and it would be nice to see if we could make that possible now.

So given that, we'd spin off a "task" for each alert instance, each of which would serially run the 3 actions. Presumably we won't have alerts with 100's of actions associated with them, but we will have alerts that have 1000's of alert instances, so that's something to worry about - maybe we'll need to do additional "chunking" beyond "task per alert instance", or only allow 50 of these to run at the same time, or something (smells like task manager).

The issue with running so many tasks is unlikely high cpu or memory usage running so many simultaneously, but that we'll be opening up so many simultaneous network connections (potentially killing the service implementing the action, or getting rate limited). Actions typically don't do a whole lot beyond security checks and then an http request.

I'm hesitant on retries - I feel like spurious 50x errors are pretty common for service providers, trying again 500ms later often works fine :-). Maybe we could at least "requeue" these to try after running all the others, with a small delay if there wasn't already one.

One thing we don't need to do immediately, but would like to do soon-ish, is somehow feed the status of the actions into the alert execution status. It could be "all the actions ran fine" or could be ['this action failed with X", "that action failed with Y"], probably with some limit, so if it was a 1000 errors, the final entry would be "and 998 other errors", kinda thing. This will finally get us to being able to display SOME error status on action executions (from alerts anyway), which we don't have today. Maybe start with an actionsErrorCount in the status, and you'd have to click something in the UI to see what they were, and we'd query the event log for those.

Oh, we'll be able to tie the action invocations to the alert invocations as well, in the event log, which we aren't doing today!

I have a feeling we may get to a point where we might say - this is too many to run right now, so let's batch them into one task manager task and queue it. Not sure how messy that would get, and puts us back in the "failed task manager docs and action task parameter saved objects" problem, but presumably less documents overall.

@chrisronline chrisronline moved this from In Progress to In Review in Kibana Alerting May 25, 2021
@gmmorris gmmorris changed the title The execution of actions should happen in the same task as the alert execution The execution of actions should happen as soon as possible after rule execution Jun 2, 2021
@gmmorris gmmorris changed the title The execution of actions should happen as soon as possible after rule execution The execution of actions is often delayed after a rule schedule them due to capacity and persistence overhead Jun 2, 2021
@gmmorris gmmorris added Feature:Alerting/RulesFramework Issues related to the Alerting Rules Framework Feature:Alerting/RuleActions Issues related to the Actions attached to Rules on the Alerting Framework labels Jul 1, 2021
Kibana Alerting automation moved this from In Review to Done (Ordered by most recent) Jul 20, 2021
@kobelb kobelb added the needs-team Issues missing a team label label Jan 31, 2022
@botelastic botelastic bot removed the needs-team Issues missing a team label label Jan 31, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Feature:Actions Feature:Alerting/RuleActions Issues related to the Actions attached to Rules on the Alerting Framework Feature:Alerting/RulesFramework Issues related to the Alerting Rules Framework Feature:Alerting Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams)
Projects
No open projects
Kibana Alerting
Done (Ordered by most recent)
Development

Successfully merging a pull request may close this issue.

6 participants