Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Alerting] Change execution of alerts from async to sync #97311

Merged

Conversation

chrisronline
Copy link
Contributor

@chrisronline chrisronline commented Apr 15, 2021

Relates to #90888
Continuation of #96719

Cloud PR: https://github.com/elastic/cloud/pull/84878

Summary

Currently, action execution occurs as any other task within task manager - a saved object is persisted in Elasticsearch and any running Kibana (connected to that Elasticsearch) is able to claim the task and execute the single action. It's a bit more involved, as we actually persist two separate saved objects - one that contains the action itself and one that contains the action params.

This is less than ideal as it introduces a fair amount of unnecessary overhead, like potentially clogging up the task manager queue and storing unnecessary saved objects (which then need to be cleaned up later and we introduce the risk of orphan saved objects)

We'd like to move away from this approach and execute actions within the same task that processed the alert. This introduces extra challenges, mainly around ensuring a single Kibana can properly handle the execution of many actions per alert.

How it works now

  1. Actions are triggered
  2. We create a saved object to store the task params
  3. We schedule the action to run through task manager
  4. Any available Kibana picks up the task and fetches the task params
  5. The action is executed

Suggested change

For the purposes of this discussion, the term "ephemeral task" is used to describe an action that is executed within the same task as the alert, instead of enqueuing itself within Task Manager.

  1. Actions are triggered
  2. We iterate through all actions, and attempt to execute each action ephemerally (up to xpack.alerting.maxEphemeralActionsPerAlert) with actionsClient.ephemeralEnqueuedExecution. Any additional actions are enqueued through the current method (actionsClient.enqueueExecution)
  3. From actionsClient.ephemeralEnqueuedExecution, we create a set of params (similarly to how actions currently work) that are passed into the task manager. These params aren't persisted in a saved object, but rather just maintained in memory.
  4. The task manager passes the task along to a custom task lifecycle class. This class pushes all enqueued actions into a queue and also subscribes to the TASK_POLLING_CYCLE event. When the event is fired and there are actions in the queue and the task pool has capacity, we execute as many actions as possible (subject to current capacity)
  5. If we are unable to add an ephemeral action to the queue (which exists within the custom task lifecycle class), we return an error. The place from where actionsClient.ephemeralEnqueuedExecution is called catches this error and will then enqueue the action through the current method (actionsClient.enqueueExecution)

New API

actionsClient.executeEphemeralTasks([{
  taskType: `actions:${action.actionTypeId}`,
  params: {
    ...action.params,
    taskParams: {
      actionId: action.id,
      apiKey,
      params: action.params,
    },
  },
  state: {},
}]);

Release and Experimentation Plan

Following a @dev wide RFC, we. have made several decisions about the release of this feature:

  1. This feature will be disabled by default when released. This feature will be marked as experimental and switching it on voids support.
  2. We are adding telemetry around this feature to assess its impact when switched on
  3. Once released, we will evaluate the stability and impact of the feature by performing stress tests on Cloud in a structured experiment. [Task manager] Evaluate the impact of Ephemeral Tasks #102616

cc @elastic/kibana-alerting-services

@spalger
Copy link
Contributor

spalger commented Apr 22, 2021

jenkins, test this (necessary for Jenkins update)

@pmuellr
Copy link
Member

pmuellr commented Apr 26, 2021

The issue with running so many tasks is unlikely high cpu or memory usage running so many simultaneously, but that we'll be opening up so many simultaneous network connections (potentially killing the service implementing the action, or getting rate limited). Actions typically don't do a whole lot beyond security checks and then an http request.

Good point here. I doubt this change will affect this behavior much, but it's something we can tackle better in the future, especially considering at least some of the service APIs support "bulk" requests.

I think it will affect the behavior. Currently, if an alert causes 1000 actions to be run, they will be queued to TM and then run in the concurrency set by TM's max_workers - by default, 10 at a time. IIRC, the new scheme would mean we do 1000 executions inline, collecting the promise of their executions, and wait for them to complete. So, we'd run 1000 at a time.

I think we need to have some kind of "batching" for these things to run. My first thought is to create a new "queue" for running these actions, for every Kibana. Then use something like rx's bufferCount to handle each "buffer" of them. We could probably use TM's max_workers for the "count", and 5 seconds or so for the time limit. Then we serialize each "batch", so only one batch runs at a time, and do the Promise.all() on those.

@pmuellr
Copy link
Member

pmuellr commented Apr 26, 2021

Do we lose any insight into running tasks?

Tons! :-)

Currently actions are run in specialized TM tasks, so we can "see" them running, both as individual tasks, and get some aggregations on them - not that we have great insight to these today, but they do exist, and I've used them in the past. I'm guessing they'd be most noticeable now as missing in the TM health stats, where they used to show up there.

I think at a minimum, we should figure out how we can "count" the number of "still running" actions, and return them in a health endpoint or similar.

Another area that we haven't provided great insight to, is the status of action executions in general. I don't think we provide any insight to this at all today, but it does give us an opportunity to add some new "indicators" in the alert execution status, as to the status of the action executions. We don't need to do that in this PR, but we should probably open a new issue to discuss how we might want to do this. One complication, if we just add some indication of action execution errors to the alert execution status, is that we'd have to wait for the actions to run before updating the alert execution status, which ... I don't think we should wait for. Maybe it would be a separate status-y thing that we'd update after all the actions run.

@chrisronline
Copy link
Contributor Author

chrisronline commented Apr 27, 2021

I think we need to have some kind of "batching" for these things to run. My first thought is to create a new "queue" for running these actions, for every Kibana. Then use something like rx's bufferCount to handle each "buffer" of them. We could probably use TM's max_workers for the "count", and 5 seconds or so for the time limit. Then we serialize each "batch", so only one batch runs at a time, and do the Promise.all() on those.

Noted! I'll look into adding this logic to the PR

@chrisronline
Copy link
Contributor Author

@pmuellr WDYT about this approach? 3800f5c

I didn't see a solution where we'd use bufferCount as the actual queue itself is just a set we process with each task lifecycle TASK_POLLING_CYCLE event so I ended up doing a simple .slice() and processing a configurable amount per "cycle"

@pmuellr
Copy link
Member

pmuellr commented Apr 27, 2021

@pmuellr WDYT about this approach? 3800f5c

I can't remember if we discussed using the existing ephemeral task running stuff for this case. @mikecote @gmmorris ? I think we currently use that for "runNow" kinda stuff? Do we want to mix these? Off the top of my head, I think the "downside" is that action executions will run with more latency since they will be mixed with existing tasks. That's off course "good" as well, in some sense, but means that the alert executions are going to take longer to finish, compared to if we don't reuse that queue.

If we go this route, we will certainly need to take into account when the alert itself completes, in terms of getting it rescheduled. We wouldn't want delayed action executions to delay the next run of the alert itself. We might already be doing that here ...

@chrisronline
Copy link
Contributor Author

@gmmorris With 9bb7ed2 added, it's ready for you to take a look at and lemme know your thoughts (cc @mikecote)

@pmuellr

action executions will run with more latency since they will be mixed with existing tasks
alert executions are going to take longer to finish, compared to if we don't reuse that queue.

I'm exactly following this. Can you clarify on what you mean? AFAIK, the actions should execute in the same task that picked up the alert so while we don't gain parallel action execution, we still gain some concurrency by using Promise.all where we one slow action will not prevent others from executing.

@chrisronline
Copy link
Contributor Author

After syncing with @gmmorris offline, I'm going to leave the monitoring-related code as-is (knowing it's not a workable solution) and going to look at adding new E2E tests that can verify these changes will hold up to some scale.

@pmuellr
Copy link
Member

pmuellr commented Apr 28, 2021

I'm not exactly following this. Can you clarify on what you mean? AFAIK, the actions should execute in the same task that picked up the alert so while we don't gain parallel action execution, we still gain some concurrency by using Promise.all where we one slow action will not prevent others from executing.

My worry was that if "alert execution" will now mean "alert execution + the time to execute all the actions (hopefully with some parallelization)", how does that impact other parts of alerting? I was wondering about alerts rescheduling their next run, but I think if we are explicitly doing that (versus letting TM manage it, can't remember), we could reschedule before or during the running of the actions. If the rescheduling of the next run of an alert is somehow based on the time it takes to run the task, then I'm worried we may have cases where alerts won't be running as often as they should be. Or we could skip an alert execution if there's tons of actions to run, and the time to run them exceeds the alert interval.

I guess our task manager stats will be somewhat affected as well. An alert execution that runs a bunch of actions will take longer than one that doesn't (presumably), so those tasks will appear to be "slower". We can still capture alert and action execution times separately in the event log though. We'll have to keep that in mind when scanning through the task manager health stats, a new favorite hobby of mine :-)

@gmmorris
Copy link
Contributor

After syncing with @gmmorris offline, I'm going to leave the monitoring-related code as-is (knowing it's not a workable solution) and going to look at adding new E2E tests that can verify these changes will hold up to some scale.

In our sync we covered a couple of things that I'll reiterate here:

  1. We don't want an ever growing number in the metrics, as I we'll obviously hit the limit of Number at some point and it isn't actually that useful as a metric. Instead, we're going to take an approach of track a running average of the ephemeral task executions:

    • How many ephemeral tasks were queued since the last polling cycle?
    • How many ephemeral tasks were ran since the last polling cycle?
    • What is the result frequency by task type of ephemeral tasks since the last polling cycle?
  2. The ephemeral task executions can probably live under the existing runtime stats, no need for a new root object.

  3. The processResult method in ephemeral_task_runner should be emitting TaskRun events for the ephemeral tasks and they will be counted the same as the non ephemeral tasks under the runtime stats. Lets double check that's true though. :)

  4. It sounds like we'd be best served by a hybrid approach where we use ephemeral tasks to run a batch of actions in the same Kibana that ran the rule, but if we have a huge number of actions, we probably still want to persist those. That cut off number should be configurable, and our end-to-end tests need to test all the permutations (all actions are run ephemerally, all are persisted [config is set to 0 maybe?], half are ephemeral and half are persisted).

  5. We might want to batch up the actions that we persist so that a single Kibana can pick up a bunch of actions at once.... but I think we can defer that to a follow up issue.. it's just something worth mulling over.

@gmmorris
Copy link
Contributor

gmmorris commented Apr 29, 2021

I'm not exactly following this. Can you clarify on what you mean? AFAIK, the actions should execute in the same task that picked up the alert so while we don't gain parallel action execution, we still gain some concurrency by using Promise.all where we one slow action will not prevent others from executing.

My worry was that if "alert execution" will now mean "alert execution + the time to execute all the actions (hopefully with some parallelization)", how does that impact other parts of alerting? I was wondering about alerts rescheduling their next run, but I think if we are explicitly doing that (versus letting TM manage it, can't remember), we could reschedule before or during the running of the actions. If the rescheduling of the next run of an alert is somehow based on the time it takes to run the task, then I'm worried we may have cases where alerts won't be running as often as they should be. Or we could skip an alert execution if there's tons of actions to run, and the time to run them exceeds the alert interval.

I guess our task manager stats will be somewhat affected as well. An alert execution that runs a bunch of actions will take longer than one that doesn't (presumably), so those tasks will appear to be "slower". We can still capture alert and action execution times separately in the event log though. We'll have to keep that in mind when scanning through the task manager health stats, a new favorite hobby of mine :-)

Yeah, I'm on the same page as Patrick here - the idea of using TM with ephemeral tasks was that we can then complete the Rule task without waiting for the actions, and those actions will still respect the capacity of the Kibana in question as those actions will take up workers.

Rescheduling the Rule should happen the moment the rule completes, not when its actions complete.

Copy link
Member

@pmuellr pmuellr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sweet, looks good! Made a number of comments, nits, WIBNIs, two of which I think are things we need to do:

  • ensure we are tracking task execution in APM (could defer to another issue if too big)
  • set a config-schema max value for ephemeral_tasks.request_capacity config (hello, unbounded queue!)

I was playing with this yesterday, saw some obvious changes when enabling/disabling it - good changes!

x-pack/plugins/task_manager/server/task_running/errors.ts Outdated Show resolved Hide resolved
docs/settings/alert-action-settings.asciidoc Outdated Show resolved Hide resolved
it('should execute all requests, when some will be ephemeral and some not', async () => {
const nonEphemeralTasks = 3;
const actionPromises = [];
for (let i = 0; i < DEFAULT_MAX_EPHEMERAL_ACTIONS_PER_ALERT + nonEphemeralTasks; i++) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I would have done this as a single action, and then arrange to have the alert create as many instances as you want actions created. But it's kinda nice to see a multi-action story; I don't think we use that much in the FTR.

)
);

const executeActionsEvents = getEventsByAction(events, 'execute');
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Too bad we can't tell if the tasks were executed ephemerally or not. Either via a top-level mustache variable, or via some property in the event log. Neither of those make much sense to me right now, but I'd bet we'll find a need/desire for something like that, at some point, so maybe we'll have a reason that makes sense later.

Is there some way we can make use of task manager health here? The problem is that it's going to be poluted with results from the previous runs. Or maybe a server side endpoint that returns usage data, we call it before and after, make sure some tasks were run ephemerally.

I don't think this is a blocker, and we shouldn't spend a lot of time on it; opening an issue to enhance this later WORKSFORME.

yarn.lock Outdated Show resolved Hide resolved
@chrisronline chrisronline requested a review from a team as a code owner July 16, 2021 20:35
@chrisronline chrisronline removed the request for review from a team July 19, 2021 14:04
@pmuellr
Copy link
Member

pmuellr commented Jul 19, 2021

What do we think is a good one liner to describe this work (used for the configs)? I did something really short with:

Enables an experimental feature that executes actions without incurring latency from persisting as traditional tasks in Task Manager

I don't like it but I don't know how to summarize this well.

It's a good start :-). It should probably mention that the downside of not persisting the actions is that there is a potential for those actions to not run, if Kibana dies.

Longish, but a little more detail:

  • Enables an experimental feature that executes a limited number of actions in the same task as the alert which triggered them. These action tasks will reduce the latency of the time it takes an action to run after it's triggered, but are not persisted as SavedObjects. These non-persisted action tasks have a risk that they won't be run at all if the Kibana instance running them exits unexpectedly.

I'm happy with something pretty short here, and have an expanded section in Troubleshooting part of the doc - which we don't need to necessarily add in this PR (but open an issue to add it, if we don't).

@pmuellr
Copy link
Member

pmuellr commented Jul 19, 2021

I mentioned in my review that it would be nice to somehow be able to tell ephemeral tasks from queued tasks in some contexts, and just thought of one:

  • if we want to measure task drift, we would probably want to do accounting on ephemeral tasks different than queued tasks, since the scheduled task delay is going to be very small (maybe even 0?) for ephemeral tasks. For example, taking an average of task drift over some particular connector types (or all connectors) will end up all the tasks together. That is still an interesting number, but it doesn't seem like it's completely accurate - we probably also want to see the task drift for JUST queued tasks. If we're using the event log to monitor this, as I did in [Alerting] Change execution of alerts from async to sync #97311 (comment), we'd need to have know the task is ephemeral by looking at the event log docs, so need an indication there.

Just noting that for follow-on work, since I have a feeling there may be other accounting things we'll want for these, after we've seen it running in practice ...

@chrisronline
Copy link
Contributor Author

@pmuellr #97311 (comment) is perfect, I love it! Thanks!

@chrisronline chrisronline requested a review from ymao1 July 20, 2021 10:36
Copy link
Contributor

@ymao1 ymao1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Great work :)

It looks like the docker allow-listing still needs to be added to this PR and I'm still curious about this comment from my last review (although not a blocker).

@kibanamachine
Copy link
Contributor

💚 Build Succeeded

Metrics [docs]

Public APIs missing comments

Total count of every public API that lacks a comment. Target amount is 0. Run node scripts/build_api_docs --plugin [yourplugin] --stats comments for more detailed information.

id before after diff
actions 116 117 +1
alerting 229 230 +1
taskManager 18 25 +7
total +9

Public APIs missing exports

Total count of every type that is part of your API that should be exported but is not. This will cause broken links in the API documentation system. Target amount is 0. Run node scripts/build_api_docs --plugin [yourplugin] --stats exports for more detailed information.

id before after diff
taskManager 7 8 +1
Unknown metric groups

API count

id before after diff
actions 116 117 +1
alerting 237 238 +1
taskManager 44 52 +8
total +10

History

To update your PR or re-run it, just comment with:
@elasticmachine merge upstream

cc @chrisronline

@chrisronline chrisronline merged commit 1f798aa into elastic:master Jul 20, 2021
@chrisronline chrisronline deleted the alerting/ephemeral_action_execution branch July 20, 2021 17:28
chrisronline added a commit to chrisronline/kibana that referenced this pull request Jul 20, 2021
* added ability to run ephemeral tasks

* fixed typing

* added typing on plugin

* WIP

* Fix type issues

* Hook up the ephemeral task into the task runner for actions

* Tasks can now run independently of one another

* Use deferred language

* Refactor taskParams slightly

* Use Promise.all

* Remove deferred logic

* Add config options to limit the amount of tasks executing at once

* Add ephemeral task monitoring

* WIP

* Add single test so far

* Ensure we log after actions have executed

* Remove confusing * 1

* Add logic to ensure we fallback to default enqueueing if the total actions is above the config

* Add additional test

* Fix tests a bit, ensure we log the alerting:actions-execute right away and the tests should listen for alerts:execute

* Better tests

* If the queue is at capacity, attempt to execute the ephemeral task as a regular action

* Ensure we run ephemeral tasks before to avoid them getting stuck in the queue

* Do not handle the promise anymore

* Remove unnecessary code

* Properly handle errors from ephemeral task lifecycle

* moved acitons domain out of alerting and into actions plugin

* Remove some tests

* Fix TS and test issues

* Fix type issues

* Fix more type issues

* Fix more type issues

* Fix jest tests

* Fix more jest tests

* Off by default

* Fix jest tests

* Update config for this suite too

* Start of telemetry code

* Fix types and add missing files

* Fix telemetry schema

* Fix types

* Fix more types

* moved load event emission to pollingcycle and added health stats on Ephemeral tasks

* Add more telemetry data based on new health metrics for the ephemeral queue

* Fix tests and types

* Add separate request capacity for ephemeral queue

* Fix telemetry schema and add tests for usage collection

* track polled tasks by persistence and use in capacity estimation instead of executions

* fixed typing

* Bump default capacity

* added delay metric to ephemeral stats

* Fix bad merge

* Fix tests

* Fix tests

* Fix types

* Skip failing tests

* Exclude ephemeral stats from capacity estimation tests

* PR feedback

* More PR feedback

* PR feedback

* Fix merge conflict

* Try fixing CI

* Fix broken lock file from merge

* Match master

* Add this back

* PR feedback

* Change to queue and add test

* Disable ephemeral queue in tests

* Updated desc

* Comment out ephemeral-specific tests tha require the entire test suite to support ephemeral tasks

* Add clarifying comment

Co-authored-by: Gidi Meir Morris <github@gidi.io>
Co-authored-by: Kibana Machine <42973632+kibanamachine@users.noreply.github.com>
chrisronline added a commit that referenced this pull request Jul 20, 2021
…06298)

* added ability to run ephemeral tasks

* fixed typing

* added typing on plugin

* WIP

* Fix type issues

* Hook up the ephemeral task into the task runner for actions

* Tasks can now run independently of one another

* Use deferred language

* Refactor taskParams slightly

* Use Promise.all

* Remove deferred logic

* Add config options to limit the amount of tasks executing at once

* Add ephemeral task monitoring

* WIP

* Add single test so far

* Ensure we log after actions have executed

* Remove confusing * 1

* Add logic to ensure we fallback to default enqueueing if the total actions is above the config

* Add additional test

* Fix tests a bit, ensure we log the alerting:actions-execute right away and the tests should listen for alerts:execute

* Better tests

* If the queue is at capacity, attempt to execute the ephemeral task as a regular action

* Ensure we run ephemeral tasks before to avoid them getting stuck in the queue

* Do not handle the promise anymore

* Remove unnecessary code

* Properly handle errors from ephemeral task lifecycle

* moved acitons domain out of alerting and into actions plugin

* Remove some tests

* Fix TS and test issues

* Fix type issues

* Fix more type issues

* Fix more type issues

* Fix jest tests

* Fix more jest tests

* Off by default

* Fix jest tests

* Update config for this suite too

* Start of telemetry code

* Fix types and add missing files

* Fix telemetry schema

* Fix types

* Fix more types

* moved load event emission to pollingcycle and added health stats on Ephemeral tasks

* Add more telemetry data based on new health metrics for the ephemeral queue

* Fix tests and types

* Add separate request capacity for ephemeral queue

* Fix telemetry schema and add tests for usage collection

* track polled tasks by persistence and use in capacity estimation instead of executions

* fixed typing

* Bump default capacity

* added delay metric to ephemeral stats

* Fix bad merge

* Fix tests

* Fix tests

* Fix types

* Skip failing tests

* Exclude ephemeral stats from capacity estimation tests

* PR feedback

* More PR feedback

* PR feedback

* Fix merge conflict

* Try fixing CI

* Fix broken lock file from merge

* Match master

* Add this back

* PR feedback

* Change to queue and add test

* Disable ephemeral queue in tests

* Updated desc

* Comment out ephemeral-specific tests tha require the entire test suite to support ephemeral tasks

* Add clarifying comment

Co-authored-by: Gidi Meir Morris <github@gidi.io>
Co-authored-by: Kibana Machine <42973632+kibanamachine@users.noreply.github.com>

Co-authored-by: Gidi Meir Morris <github@gidi.io>
Co-authored-by: Kibana Machine <42973632+kibanamachine@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

The execution of actions is often delayed after a rule schedule them due to capacity and persistence overhead
9 participants