Error handling for Alerting and Actions #39349

mikecote · 2019-06-20T14:35:02Z

After a design discussion with @bmcconaghy, @epixa, @peterschretlen, @pmuellr and @mikecote, I'm writing down the outcome of what we think error handling should be for Alerting and Actions.

Actions

We will try at most once to execute an action
There will be the option to define the max number of attempts
Retry logic will follow suggestion by the executor result
The failures will be logged into history
Alerts can be built on top of history to notify when actions are failing

Alerting

When an alert fails execution, no retry logic is applied, next execution is at the regular next interval
Executors will be provided the date and time of the last successful execution
The history will keep track of how long alert types take to execute
Meta alert types to be created for alerting on errors

elasticmachine · 2019-06-20T14:35:04Z

Pinging @elastic/kibana-stack-services

bmcconaghy · 2019-06-20T14:43:52Z

I think the "at most once" actions approach is too generic. For some error types, it totally makes sense (action is not configured properly and fails validation, for example). For other errors, I think it does not make sense as the error could be transient in nature and the attempt might succeed if tried again later (can't connect to SMTP server, things like that). Seems like we really want a way to flag something that should be retried (or inversely, to flag something that should not be retried).

pmuellr · 2019-06-20T15:35:55Z

Seems like we really want a way to flag something that should be retried (or inversely, to flag something that should not be retried).

After seeing retries live, for actions that were failing because of bugs in my action type executor, I think we DEFINITELY want to - by default - not retry, and allow the executor to indicate that it should go into retry mode explicitly.

Eg, if my executor is making an http request (guessing this will be very common), you'd want 40x responses to usually not be retried, you'd want 50x responses to probably be retried (transient error). By looking at the http response, you can probably make a pretty good guess at whether to retry or not.

For rate-limiting, Slack sends a 429 response with a header Retry-After: <seconds>, and maybe that appears with other "error" statuses as well? 503? see: https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Retry-After

Would of course be great to feed this info into the retry logic. In fact, the Slack action type should track this as well, but what would it do if it knows it's being rate limited but a request is made to fire a slack action? Obviously, it should even bother to make the request to Slack now, I guess it could set the time to run the task when it adds the task to task manager? We clearly are going to need a metric for "how many queued action tasks are in task manager" heh (unbounded queue).

mikecote · 2019-06-24T13:08:27Z

@pmuellr @bmcconaghy what initial type of retry logic should we do for actions?

Example:

Try again in 15 second and in 30 seconds?
Try again in 15 seconds and back-pressure with a multiple of 3? Stop after 1 hour?

pmuellr · 2019-06-24T16:05:08Z

I'm thinking we should make ActionTypes provide the retry info. Eg, an ActionType that does an http request can fairly easily determine if a retry is even warranted (usually 40x == do not retry), and would have a better sense of when to retry. In the case of 429 responses with a Retry-After header, the action would even know WHEN to do the next retry.

Probably having some fixed "schedule" for retries available from the ActionType, with a hard-coded default make sense. Here's an npm package (I've never used) that handles "backoffs" generically - https://www.npmjs.com/package/backoff - seems reasonable to me. But again, would like to allow some kind of "don't bother trying again until time X" for the case of rate-limiting with a Retry-After.

Guessing that "limits" like "stop after 1 hour" will be hard to reason about for humans, should probably just be "stop after X retry attempts".

mikecote · 2019-07-24T17:58:37Z

Did the following changes into the description:

Actions

+ There will be the option to define the max number of attempts
+ There will be the option to determine retry delay

Alerting

- A multiple of 3 will apply to the next scheduled execution when an alert fails. This will keep multiplying until an execution is successful
+ A multiple of 5 minutes will apply to the number of attempts when an alert fails. This will keep multiplying until an execution is successful

pmuellr · 2019-07-25T03:52:46Z

Should the action options - max # of attempts / delay length - be specified in action instances (when you create an action from an action type)? That seems right, since the retry characteristics are likely to be specific to the action type, and perhaps customizable per action instance. At least that way, it would cut down on the amount of config required for individual alerts.

Also, we're not accounting for the case where we "know" when we can retry, like with a 429 response with a Retry-After header. We could punt on that for now, look into supporting it later.

If we do end up adding this to Actions, I think we want to ignore it (probably) for the HTTP endpoint used to "fire" an action - they should never be retried, since we're expecting a "synchronous" response back - you don't want an http request to send a slack message to take 5 minutes to return back to the client :-) . Again, perhaps something to deal with in the future ...

mikecote · 2019-07-25T17:13:29Z

For the first part, if I understand correctly, instead of defining maxAttempts with the action type (ex: .slack, maxAttempts: 5), we should define maxAttempts with the alert actions? (Ex: threshold alert, do .slack action, maxAttempts: 5. Do .email action, maxAttempts: 3)?

For the second part, action types can define a getRetryDelay(attempts, error) { ... } and return a number in seconds that task manager should wait before trying the action again. It could look for the header and return a number accordingly. The error object would be the error thrown during an action executor call.

For the third part, agreed :) And no code changes required to make this work. The API call bypasses task manager. All the logic that we're discussing here would be ignored.

mikecote · 2019-07-25T20:44:02Z

As discussed with @pmuellr, the getRetryDelay function in task manager will change to getRetry which can return true | false | Date.

Action types will not be able to define getRetry function as we will define a single one within actions. The reason the actions plugin will handle getRetry is because it will handle the returned result of the executor with it. Which will make action type executors have to return a failure instead of throwing one in order to take advantage of the custom retry logic.

The maxAttempts will still live with the action type instead of the alert actions.

We will wait for feature requests before allowing alert types to define their own retry logic. It isn't as needed on that side since alerts are on intervals and can use the built in back pressure until we have a scenario that needs to customize it.

mikecote · 2019-07-25T20:47:48Z

Latest changes:

Actions

- There will be the option to determine retry delay
+ Retry logic will follow suggestion by the executor result

Alerting

- In the future, alert types will be able to define their own back-pressure formula
- We will cap back-pressure to 1 day max
- In the future, alert types will be able to define their back-pressure cap
- In the future, errors will be able to opt-out from retrying and simply fail the execution until the next regular interval

mikecote · 2019-07-26T15:31:57Z

@pmuellr @bmcconaghy as per our discussion, latest changes:

Alerting

- Back-pressure will be in place when alerts fail to execute
- A multiple of 5 minutes will apply to the number of attempts when an alert fails. This will keep multiplying until an execution is successful
+ When an alert fails execution, no retry logic is applied, next execution is at the regular next interval

mikecote · 2019-07-26T15:32:41Z

Removed providing attempts to alert executors until feature is requested:

Alerting

- Executors will be provided the number of times the execution has been failing

mikecote · 2019-11-12T03:08:00Z

Closing as error handling is now implemented and meta alerts issue is created: #49410.

…rring tasks (#83682) This addresses a bug in Task Manager in the task timeout behaviour. When a recurring task's `retryAt` field is set (which happens at task run), it is currently scheduled to the task definition's `timeout` value, but the original intention was for these tasks to retry on their next scheduled run (originally identified as part of #39349). In this PR we ensure recurring task retries are scheduled according to their recurring schedule, rather than the default `timeout` of the task type.

…rring tasks (elastic#83682) This addresses a bug in Task Manager in the task timeout behaviour. When a recurring task's `retryAt` field is set (which happens at task run), it is currently scheduled to the task definition's `timeout` value, but the original intention was for these tasks to retry on their next scheduled run (originally identified as part of elastic#39349). In this PR we ensure recurring task retries are scheduled according to their recurring schedule, rather than the default `timeout` of the task type.

…rring tasks (#83682) (#83800) This addresses a bug in Task Manager in the task timeout behaviour. When a recurring task's `retryAt` field is set (which happens at task run), it is currently scheduled to the task definition's `timeout` value, but the original intention was for these tasks to retry on their next scheduled run (originally identified as part of #39349). In this PR we ensure recurring task retries are scheduled according to their recurring schedule, rather than the default `timeout` of the task type.

…rring tasks (elastic#83682) This addresses a bug in Task Manager in the task timeout behaviour. When a recurring task's `retryAt` field is set (which happens at task run), it is currently scheduled to the task definition's `timeout` value, but the original intention was for these tasks to retry on their next scheduled run (originally identified as part of elastic#39349). In this PR we ensure recurring task retries are scheduled according to their recurring schedule, rather than the default `timeout` of the task type.

mikecote added discuss Feature:Alerting Team:Stack Services labels Jun 20, 2019

mikecote self-assigned this Jun 20, 2019

mikecote mentioned this issue Jul 24, 2019

Implement wanted error handling for alerting and actions #41917

Merged

mikecote mentioned this issue Jul 26, 2019

Modify task manager getRetryDelay to be more custom #42064

Merged

mikecote closed this as completed Nov 12, 2019

mikecote mentioned this issue Dec 12, 2019

Ability to make task manager run a task immediately #50214

Closed

mikecote mentioned this issue Dec 19, 2019

Test alerting and action error handling to ensure it works as designed #53650

Closed

mikecote mentioned this issue Apr 23, 2020

Alert types to have customizable execution timeouts / retries #63188

Closed

mikecote mentioned this issue Nov 16, 2020

Alerts that timeout during execution don't respect their schedule #83490

Closed

gmmorris mentioned this issue Nov 18, 2020

schedule retry based on schedule on recurring tasks #83682

Merged

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error handling for Alerting and Actions #39349

Error handling for Alerting and Actions #39349

mikecote commented Jun 20, 2019 •

edited

Loading

elasticmachine commented Jun 20, 2019

bmcconaghy commented Jun 20, 2019

pmuellr commented Jun 20, 2019

mikecote commented Jun 24, 2019

pmuellr commented Jun 24, 2019

mikecote commented Jul 24, 2019

pmuellr commented Jul 25, 2019

mikecote commented Jul 25, 2019

mikecote commented Jul 25, 2019

mikecote commented Jul 25, 2019

mikecote commented Jul 26, 2019

mikecote commented Jul 26, 2019

mikecote commented Nov 12, 2019

Error handling for Alerting and Actions #39349

Error handling for Alerting and Actions #39349

Comments

mikecote commented Jun 20, 2019 • edited Loading

Actions

Alerting

elasticmachine commented Jun 20, 2019

bmcconaghy commented Jun 20, 2019

pmuellr commented Jun 20, 2019

mikecote commented Jun 24, 2019

pmuellr commented Jun 24, 2019

mikecote commented Jul 24, 2019

Actions

Alerting

pmuellr commented Jul 25, 2019

mikecote commented Jul 25, 2019

mikecote commented Jul 25, 2019

mikecote commented Jul 25, 2019

Actions

Alerting

mikecote commented Jul 26, 2019

Alerting

mikecote commented Jul 26, 2019

Alerting

mikecote commented Nov 12, 2019

mikecote commented Jun 20, 2019 •

edited

Loading