New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Alerting] Change execution of alerts from async to sync #97311
[Alerting] Change execution of alerts from async to sync #97311
Conversation
jenkins, test this (necessary for Jenkins update) |
I think it will affect the behavior. Currently, if an alert causes 1000 actions to be run, they will be queued to TM and then run in the concurrency set by TM's I think we need to have some kind of "batching" for these things to run. My first thought is to create a new "queue" for running these actions, for every Kibana. Then use something like rx's bufferCount to handle each "buffer" of them. We could probably use TM's |
Tons! :-) Currently actions are run in specialized TM tasks, so we can "see" them running, both as individual tasks, and get some aggregations on them - not that we have great insight to these today, but they do exist, and I've used them in the past. I'm guessing they'd be most noticeable now as missing in the TM health stats, where they used to show up there. I think at a minimum, we should figure out how we can "count" the number of "still running" actions, and return them in a health endpoint or similar. Another area that we haven't provided great insight to, is the status of action executions in general. I don't think we provide any insight to this at all today, but it does give us an opportunity to add some new "indicators" in the alert execution status, as to the status of the action executions. We don't need to do that in this PR, but we should probably open a new issue to discuss how we might want to do this. One complication, if we just add some indication of action execution errors to the alert execution status, is that we'd have to wait for the actions to run before updating the alert execution status, which ... I don't think we should wait for. Maybe it would be a separate status-y thing that we'd update after all the actions run. |
…_action_execution
Noted! I'll look into adding this logic to the PR |
x-pack/plugins/alerting/server/task_runner/create_execution_handler.ts
Outdated
Show resolved
Hide resolved
I can't remember if we discussed using the existing ephemeral task running stuff for this case. @mikecote @gmmorris ? I think we currently use that for "runNow" kinda stuff? Do we want to mix these? Off the top of my head, I think the "downside" is that action executions will run with more latency since they will be mixed with existing tasks. That's off course "good" as well, in some sense, but means that the alert executions are going to take longer to finish, compared to if we don't reuse that queue. If we go this route, we will certainly need to take into account when the alert itself completes, in terms of getting it rescheduled. We wouldn't want delayed action executions to delay the next run of the alert itself. We might already be doing that here ... |
@gmmorris With 9bb7ed2 added, it's ready for you to take a look at and lemme know your thoughts (cc @mikecote)
I'm exactly following this. Can you clarify on what you mean? AFAIK, the actions should execute in the same task that picked up the alert so while we don't gain parallel action execution, we still gain some concurrency by using |
After syncing with @gmmorris offline, I'm going to leave the monitoring-related code as-is (knowing it's not a workable solution) and going to look at adding new E2E tests that can verify these changes will hold up to some scale. |
My worry was that if "alert execution" will now mean "alert execution + the time to execute all the actions (hopefully with some parallelization)", how does that impact other parts of alerting? I was wondering about alerts rescheduling their next run, but I think if we are explicitly doing that (versus letting TM manage it, can't remember), we could reschedule before or during the running of the actions. If the rescheduling of the next run of an alert is somehow based on the time it takes to run the task, then I'm worried we may have cases where alerts won't be running as often as they should be. Or we could skip an alert execution if there's tons of actions to run, and the time to run them exceeds the alert interval. I guess our task manager stats will be somewhat affected as well. An alert execution that runs a bunch of actions will take longer than one that doesn't (presumably), so those tasks will appear to be "slower". We can still capture alert and action execution times separately in the event log though. We'll have to keep that in mind when scanning through the task manager health stats, a new favorite hobby of mine :-) |
In our sync we covered a couple of things that I'll reiterate here:
|
Yeah, I'm on the same page as Patrick here - the idea of using TM with ephemeral tasks was that we can then complete the Rule task without waiting for the actions, and those actions will still respect the capacity of the Kibana in question as those actions will take up workers. Rescheduling the Rule should happen the moment the rule completes, not when its actions complete. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sweet, looks good! Made a number of comments, nits, WIBNIs, two of which I think are things we need to do:
- ensure we are tracking task execution in APM (could defer to another issue if too big)
- set a config-schema max value for
ephemeral_tasks.request_capacity
config (hello, unbounded queue!)
I was playing with this yesterday, saw some obvious changes when enabling/disabling it - good changes!
x-pack/plugins/alerting/server/task_runner/create_execution_handler.ts
Outdated
Show resolved
Hide resolved
x-pack/plugins/alerting/server/task_runner/create_execution_handler.ts
Outdated
Show resolved
Hide resolved
x-pack/plugins/task_manager/server/task_running/ephemeral_task_runner.ts
Show resolved
Hide resolved
it('should execute all requests, when some will be ephemeral and some not', async () => { | ||
const nonEphemeralTasks = 3; | ||
const actionPromises = []; | ||
for (let i = 0; i < DEFAULT_MAX_EPHEMERAL_ACTIONS_PER_ALERT + nonEphemeralTasks; i++) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I would have done this as a single action, and then arrange to have the alert create as many instances as you want actions created. But it's kinda nice to see a multi-action story; I don't think we use that much in the FTR.
) | ||
); | ||
|
||
const executeActionsEvents = getEventsByAction(events, 'execute'); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Too bad we can't tell if the tasks were executed ephemerally or not. Either via a top-level mustache variable, or via some property in the event log. Neither of those make much sense to me right now, but I'd bet we'll find a need/desire for something like that, at some point, so maybe we'll have a reason that makes sense later.
Is there some way we can make use of task manager health here? The problem is that it's going to be poluted with results from the previous runs. Or maybe a server side endpoint that returns usage data, we call it before and after, make sure some tasks were run ephemerally.
I don't think this is a blocker, and we shouldn't spend a lot of time on it; opening an issue to enhance this later WORKSFORME.
It's a good start :-). It should probably mention that the downside of not persisting the actions is that there is a potential for those actions to not run, if Kibana dies. Longish, but a little more detail:
I'm happy with something pretty short here, and have an expanded section in Troubleshooting part of the doc - which we don't need to necessarily add in this PR (but open an issue to add it, if we don't). |
I mentioned in my review that it would be nice to somehow be able to tell ephemeral tasks from queued tasks in some contexts, and just thought of one:
Just noting that for follow-on work, since I have a feeling there may be other accounting things we'll want for these, after we've seen it running in practice ... |
@pmuellr #97311 (comment) is perfect, I love it! Thanks! |
…e to support ephemeral tasks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Great work :)
It looks like the docker allow-listing still needs to be added to this PR and I'm still curious about this comment from my last review (although not a blocker).
💚 Build Succeeded
Metrics [docs]Public APIs missing comments
Public APIs missing exports
Unknown metric groupsAPI count
History
To update your PR or re-run it, just comment with: |
* added ability to run ephemeral tasks * fixed typing * added typing on plugin * WIP * Fix type issues * Hook up the ephemeral task into the task runner for actions * Tasks can now run independently of one another * Use deferred language * Refactor taskParams slightly * Use Promise.all * Remove deferred logic * Add config options to limit the amount of tasks executing at once * Add ephemeral task monitoring * WIP * Add single test so far * Ensure we log after actions have executed * Remove confusing * 1 * Add logic to ensure we fallback to default enqueueing if the total actions is above the config * Add additional test * Fix tests a bit, ensure we log the alerting:actions-execute right away and the tests should listen for alerts:execute * Better tests * If the queue is at capacity, attempt to execute the ephemeral task as a regular action * Ensure we run ephemeral tasks before to avoid them getting stuck in the queue * Do not handle the promise anymore * Remove unnecessary code * Properly handle errors from ephemeral task lifecycle * moved acitons domain out of alerting and into actions plugin * Remove some tests * Fix TS and test issues * Fix type issues * Fix more type issues * Fix more type issues * Fix jest tests * Fix more jest tests * Off by default * Fix jest tests * Update config for this suite too * Start of telemetry code * Fix types and add missing files * Fix telemetry schema * Fix types * Fix more types * moved load event emission to pollingcycle and added health stats on Ephemeral tasks * Add more telemetry data based on new health metrics for the ephemeral queue * Fix tests and types * Add separate request capacity for ephemeral queue * Fix telemetry schema and add tests for usage collection * track polled tasks by persistence and use in capacity estimation instead of executions * fixed typing * Bump default capacity * added delay metric to ephemeral stats * Fix bad merge * Fix tests * Fix tests * Fix types * Skip failing tests * Exclude ephemeral stats from capacity estimation tests * PR feedback * More PR feedback * PR feedback * Fix merge conflict * Try fixing CI * Fix broken lock file from merge * Match master * Add this back * PR feedback * Change to queue and add test * Disable ephemeral queue in tests * Updated desc * Comment out ephemeral-specific tests tha require the entire test suite to support ephemeral tasks * Add clarifying comment Co-authored-by: Gidi Meir Morris <github@gidi.io> Co-authored-by: Kibana Machine <42973632+kibanamachine@users.noreply.github.com>
…06298) * added ability to run ephemeral tasks * fixed typing * added typing on plugin * WIP * Fix type issues * Hook up the ephemeral task into the task runner for actions * Tasks can now run independently of one another * Use deferred language * Refactor taskParams slightly * Use Promise.all * Remove deferred logic * Add config options to limit the amount of tasks executing at once * Add ephemeral task monitoring * WIP * Add single test so far * Ensure we log after actions have executed * Remove confusing * 1 * Add logic to ensure we fallback to default enqueueing if the total actions is above the config * Add additional test * Fix tests a bit, ensure we log the alerting:actions-execute right away and the tests should listen for alerts:execute * Better tests * If the queue is at capacity, attempt to execute the ephemeral task as a regular action * Ensure we run ephemeral tasks before to avoid them getting stuck in the queue * Do not handle the promise anymore * Remove unnecessary code * Properly handle errors from ephemeral task lifecycle * moved acitons domain out of alerting and into actions plugin * Remove some tests * Fix TS and test issues * Fix type issues * Fix more type issues * Fix more type issues * Fix jest tests * Fix more jest tests * Off by default * Fix jest tests * Update config for this suite too * Start of telemetry code * Fix types and add missing files * Fix telemetry schema * Fix types * Fix more types * moved load event emission to pollingcycle and added health stats on Ephemeral tasks * Add more telemetry data based on new health metrics for the ephemeral queue * Fix tests and types * Add separate request capacity for ephemeral queue * Fix telemetry schema and add tests for usage collection * track polled tasks by persistence and use in capacity estimation instead of executions * fixed typing * Bump default capacity * added delay metric to ephemeral stats * Fix bad merge * Fix tests * Fix tests * Fix types * Skip failing tests * Exclude ephemeral stats from capacity estimation tests * PR feedback * More PR feedback * PR feedback * Fix merge conflict * Try fixing CI * Fix broken lock file from merge * Match master * Add this back * PR feedback * Change to queue and add test * Disable ephemeral queue in tests * Updated desc * Comment out ephemeral-specific tests tha require the entire test suite to support ephemeral tasks * Add clarifying comment Co-authored-by: Gidi Meir Morris <github@gidi.io> Co-authored-by: Kibana Machine <42973632+kibanamachine@users.noreply.github.com> Co-authored-by: Gidi Meir Morris <github@gidi.io> Co-authored-by: Kibana Machine <42973632+kibanamachine@users.noreply.github.com>
Relates to #90888
Continuation of #96719
Cloud PR: https://github.com/elastic/cloud/pull/84878
Summary
Currently, action execution occurs as any other task within task manager - a saved object is persisted in Elasticsearch and any running Kibana (connected to that Elasticsearch) is able to claim the task and execute the single action. It's a bit more involved, as we actually persist two separate saved objects - one that contains the action itself and one that contains the action params.
This is less than ideal as it introduces a fair amount of unnecessary overhead, like potentially clogging up the task manager queue and storing unnecessary saved objects (which then need to be cleaned up later and we introduce the risk of orphan saved objects)
We'd like to move away from this approach and execute actions within the same task that processed the alert. This introduces extra challenges, mainly around ensuring a single Kibana can properly handle the execution of many actions per alert.
How it works now
Suggested change
For the purposes of this discussion, the term "ephemeral task" is used to describe an action that is executed within the same task as the alert, instead of enqueuing itself within Task Manager.
xpack.alerting.maxEphemeralActionsPerAlert
) withactionsClient.ephemeralEnqueuedExecution
. Any additional actions are enqueued through the current method (actionsClient.enqueueExecution
)actionsClient.ephemeralEnqueuedExecution
, we create a set of params (similarly to how actions currently work) that are passed into the task manager. These params aren't persisted in a saved object, but rather just maintained in memory.TASK_POLLING_CYCLE
event. When the event is fired and there are actions in the queue and the task pool has capacity, we execute as many actions as possible (subject to current capacity)actionsClient.ephemeralEnqueuedExecution
is called catches this error and will then enqueue the action through the current method (actionsClient.enqueueExecution
)New API
Release and Experimentation Plan
Following a @dev wide RFC, we. have made several decisions about the release of this feature:
cc @elastic/kibana-alerting-services