-
Notifications
You must be signed in to change notification settings - Fork 8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
The execution of actions is often delayed after a rule schedule them due to capacity and persistence overhead #90888
Comments
Pinging @elastic/kibana-alerting-services (Team:Alerting Services) |
I think this issue would be a good one to work on in ~7.14. |
This will provide the opportunity to more easily "pipeline" actions - run a Jira action to create an issue, where the issue #/link are returned in the action result. Which can then - somehow - be passed to subsequent actions, like an email/slack action that would include a link to the Jira issue. Still lots of design and code to get there, but running actions in the task as the alert makes this a lot easier than the current architecture. It's also interesting because today actions only have access to the "context variables" the alert provided, which are serialized JSON. If we're running in the same flow as alerts, there's a possibility we could make more interesting alert data available - presumably via function access. As a future enhancement. Likewise, on the "output" end of actions, we should be able to capture more error diagnostic information that the alert could know about. Today that's difficult because the alert just queues the actions to run and then never gets their results. There is likely a lot of optimization we can make when we have to run the same actions for multiple instances. Security checks and what-not. Probably as a follow-on PR to just calling the actions via the action client today. |
One thing to consider in this regard is that the next run of the Alert won't happen until all actions have executed - I'm not sure that's the behavior we want. Perhaps the approach should be to group all actions in a single task (rather than a task per action, as we have now), but keep that task separate from the Alert task? |
Can we elaborate on the various scenarios we should ensure still work properly with any change to this behavior? For example, if we decide to handle action execution within the same task, it might mean we lose out on multi-Kibana alert execution (as in the single Kibana who picked up the task must execute all actions) - is that something we can't change? Also, AFAIK, we do not have any sort of retry logic in place if the action task if it fails, and I'm assuming we don't need to change this behavior, but do we need to leave the door open in the future for retries? Or is that not something we've committed to doing? |
There's probably a bunch, but here's what crosses my mind at this moment in time:
Sigh, this is tricky.
Agreed, we don't need to introduce new retry logic, but we might want to in the future. One general note: |
Bunch of rando thoughts below, but I feel like there's a bunch of unknown unknowns, so I think hacking at it to get it to basically work is the way to start, and then see what happens. Our stress tests invoke actions, right? I think maybe just server log though? Will be good enough to start with, make sure even a basic implementation isn't breaking something. ... rando thoughts ... Yes, let's not spin off tasks with
My memory is that we spin off a task for each action - so if you have 10 alert instances that need to call 3 actions, that's 30 tasks. And we basically just queue them all to run independently, at the same time. Longer-term, it would be nice to serialize those 3 actions, so the 2nd isn't invoked till the 1st is done, and the 3rd isn't invoked till the 2nd is done. Because we'll want to be able to feed the output from previous actions as input to the subsequent actions. That's all we know about that - not clear HOW we pass that data! But it does imply there's some ordering, and it would be nice to see if we could make that possible now. So given that, we'd spin off a "task" for each alert instance, each of which would serially run the 3 actions. Presumably we won't have alerts with 100's of actions associated with them, but we will have alerts that have 1000's of alert instances, so that's something to worry about - maybe we'll need to do additional "chunking" beyond "task per alert instance", or only allow 50 of these to run at the same time, or something (smells like task manager). The issue with running so many tasks is unlikely high cpu or memory usage running so many simultaneously, but that we'll be opening up so many simultaneous network connections (potentially killing the service implementing the action, or getting rate limited). Actions typically don't do a whole lot beyond security checks and then an http request. I'm hesitant on retries - I feel like spurious 50x errors are pretty common for service providers, trying again 500ms later often works fine :-). Maybe we could at least "requeue" these to try after running all the others, with a small delay if there wasn't already one. One thing we don't need to do immediately, but would like to do soon-ish, is somehow feed the status of the actions into the alert execution status. It could be "all the actions ran fine" or could be ['this action failed with X", "that action failed with Y"], probably with some limit, so if it was a 1000 errors, the final entry would be "and 998 other errors", kinda thing. This will finally get us to being able to display SOME error status on action executions (from alerts anyway), which we don't have today. Maybe start with an Oh, we'll be able to tie the action invocations to the alert invocations as well, in the event log, which we aren't doing today! I have a feeling we may get to a point where we might say - this is too many to run right now, so let's batch them into one task manager task and queue it. Not sure how messy that would get, and puts us back in the "failed task manager docs and action task parameter saved objects" problem, but presumably less documents overall. |
There is a lot of overhead running alert actions, each in their own task. It can also clog up the task manager queue when many alert actions need to be scheduled or be hard to serialize when the payload is large. We should look at a way to eliminate the usage of tasks by the actions plugin and run in the same execution as the alert. This will allow a lot of code cleanup (action_task_params, task runner, etc.) and help with the overall performance.
Since the retry logic doesn't work at this time (#79169), we could consider the model "at most once" for now and worry about retry logic as the requirements arrive.
The text was updated successfully, but these errors were encountered: