Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error handling for Alerting and Actions #39349

Closed
mikecote opened this issue Jun 20, 2019 · 13 comments
Closed

Error handling for Alerting and Actions #39349

mikecote opened this issue Jun 20, 2019 · 13 comments

Comments

@mikecote
Copy link
Contributor

mikecote commented Jun 20, 2019

After a design discussion with @bmcconaghy, @epixa, @peterschretlen, @pmuellr and @mikecote, I'm writing down the outcome of what we think error handling should be for Alerting and Actions.

Actions

  • We will try at most once to execute an action
  • There will be the option to define the max number of attempts
  • Retry logic will follow suggestion by the executor result
  • The failures will be logged into history
  • Alerts can be built on top of history to notify when actions are failing

Alerting

  • When an alert fails execution, no retry logic is applied, next execution is at the regular next interval
  • Executors will be provided the date and time of the last successful execution
  • The history will keep track of how long alert types take to execute
  • Meta alert types to be created for alerting on errors
@elasticmachine
Copy link
Contributor

Pinging @elastic/kibana-stack-services

@bmcconaghy
Copy link
Contributor

I think the "at most once" actions approach is too generic. For some error types, it totally makes sense (action is not configured properly and fails validation, for example). For other errors, I think it does not make sense as the error could be transient in nature and the attempt might succeed if tried again later (can't connect to SMTP server, things like that). Seems like we really want a way to flag something that should be retried (or inversely, to flag something that should not be retried).

@pmuellr
Copy link
Member

pmuellr commented Jun 20, 2019

Seems like we really want a way to flag something that should be retried (or inversely, to flag something that should not be retried).

After seeing retries live, for actions that were failing because of bugs in my action type executor, I think we DEFINITELY want to - by default - not retry, and allow the executor to indicate that it should go into retry mode explicitly.

Eg, if my executor is making an http request (guessing this will be very common), you'd want 40x responses to usually not be retried, you'd want 50x responses to probably be retried (transient error). By looking at the http response, you can probably make a pretty good guess at whether to retry or not.

For rate-limiting, Slack sends a 429 response with a header Retry-After: <seconds>, and maybe that appears with other "error" statuses as well? 503? see: https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Retry-After

Would of course be great to feed this info into the retry logic. In fact, the Slack action type should track this as well, but what would it do if it knows it's being rate limited but a request is made to fire a slack action? Obviously, it should even bother to make the request to Slack now, I guess it could set the time to run the task when it adds the task to task manager? We clearly are going to need a metric for "how many queued action tasks are in task manager" heh (unbounded queue).

@mikecote
Copy link
Contributor Author

@pmuellr @bmcconaghy what initial type of retry logic should we do for actions?

Example:

  • Try again in 15 second and in 30 seconds?
  • Try again in 15 seconds and back-pressure with a multiple of 3? Stop after 1 hour?

@pmuellr
Copy link
Member

pmuellr commented Jun 24, 2019

I'm thinking we should make ActionTypes provide the retry info. Eg, an ActionType that does an http request can fairly easily determine if a retry is even warranted (usually 40x == do not retry), and would have a better sense of when to retry. In the case of 429 responses with a Retry-After header, the action would even know WHEN to do the next retry.

Probably having some fixed "schedule" for retries available from the ActionType, with a hard-coded default make sense. Here's an npm package (I've never used) that handles "backoffs" generically - https://www.npmjs.com/package/backoff - seems reasonable to me. But again, would like to allow some kind of "don't bother trying again until time X" for the case of rate-limiting with a Retry-After.

Guessing that "limits" like "stop after 1 hour" will be hard to reason about for humans, should probably just be "stop after X retry attempts".

@mikecote
Copy link
Contributor Author

Did the following changes into the description:

Actions

+ There will be the option to define the max number of attempts
+ There will be the option to determine retry delay

Alerting

- A multiple of 3 will apply to the next scheduled execution when an alert fails. This will keep multiplying until an execution is successful
+ A multiple of 5 minutes will apply to the number of attempts when an alert fails. This will keep multiplying until an execution is successful

@pmuellr
Copy link
Member

pmuellr commented Jul 25, 2019

Should the action options - max # of attempts / delay length - be specified in action instances (when you create an action from an action type)? That seems right, since the retry characteristics are likely to be specific to the action type, and perhaps customizable per action instance. At least that way, it would cut down on the amount of config required for individual alerts.

Also, we're not accounting for the case where we "know" when we can retry, like with a 429 response with a Retry-After header. We could punt on that for now, look into supporting it later.

If we do end up adding this to Actions, I think we want to ignore it (probably) for the HTTP endpoint used to "fire" an action - they should never be retried, since we're expecting a "synchronous" response back - you don't want an http request to send a slack message to take 5 minutes to return back to the client :-) . Again, perhaps something to deal with in the future ...

@mikecote
Copy link
Contributor Author

For the first part, if I understand correctly, instead of defining maxAttempts with the action type (ex: .slack, maxAttempts: 5), we should define maxAttempts with the alert actions? (Ex: threshold alert, do .slack action, maxAttempts: 5. Do .email action, maxAttempts: 3)?

For the second part, action types can define a getRetryDelay(attempts, error) { ... } and return a number in seconds that task manager should wait before trying the action again. It could look for the header and return a number accordingly. The error object would be the error thrown during an action executor call.

For the third part, agreed :) And no code changes required to make this work. The API call bypasses task manager. All the logic that we're discussing here would be ignored.

@mikecote
Copy link
Contributor Author

As discussed with @pmuellr, the getRetryDelay function in task manager will change to getRetry which can return true | false | Date.

Action types will not be able to define getRetry function as we will define a single one within actions. The reason the actions plugin will handle getRetry is because it will handle the returned result of the executor with it. Which will make action type executors have to return a failure instead of throwing one in order to take advantage of the custom retry logic.

The maxAttempts will still live with the action type instead of the alert actions.

We will wait for feature requests before allowing alert types to define their own retry logic. It isn't as needed on that side since alerts are on intervals and can use the built in back pressure until we have a scenario that needs to customize it.

@mikecote
Copy link
Contributor Author

Latest changes:

Actions

- There will be the option to determine retry delay
+ Retry logic will follow suggestion by the executor result

Alerting

- In the future, alert types will be able to define their own back-pressure formula
- We will cap back-pressure to 1 day max
- In the future, alert types will be able to define their back-pressure cap
- In the future, errors will be able to opt-out from retrying and simply fail the execution until the next regular interval

@mikecote
Copy link
Contributor Author

@pmuellr @bmcconaghy as per our discussion, latest changes:

Alerting

- Back-pressure will be in place when alerts fail to execute
- A multiple of 5 minutes will apply to the number of attempts when an alert fails. This will keep multiplying until an execution is successful
+ When an alert fails execution, no retry logic is applied, next execution is at the regular next interval

@mikecote
Copy link
Contributor Author

Removed providing attempts to alert executors until feature is requested:

Alerting

- Executors will be provided the number of times the execution has been failing

@mikecote
Copy link
Contributor Author

Closing as error handling is now implemented and meta alerts issue is created: #49410.

gmmorris added a commit that referenced this issue Nov 19, 2020
…rring tasks (#83682)

This addresses a bug in Task Manager in the task timeout behaviour. When a recurring task's `retryAt` field is set (which happens at task run), it is currently scheduled to the task definition's `timeout` value, but the original intention was for these tasks to retry on their next scheduled run (originally identified as part of #39349).

In this PR we ensure recurring task retries are scheduled according to their recurring schedule, rather than the default `timeout` of the task type.
gmmorris added a commit to gmmorris/kibana that referenced this issue Nov 19, 2020
…rring tasks (elastic#83682)

This addresses a bug in Task Manager in the task timeout behaviour. When a recurring task's `retryAt` field is set (which happens at task run), it is currently scheduled to the task definition's `timeout` value, but the original intention was for these tasks to retry on their next scheduled run (originally identified as part of elastic#39349).

In this PR we ensure recurring task retries are scheduled according to their recurring schedule, rather than the default `timeout` of the task type.
gmmorris added a commit that referenced this issue Nov 19, 2020
…rring tasks (#83682) (#83800)

This addresses a bug in Task Manager in the task timeout behaviour. When a recurring task's `retryAt` field is set (which happens at task run), it is currently scheduled to the task definition's `timeout` value, but the original intention was for these tasks to retry on their next scheduled run (originally identified as part of #39349).

In this PR we ensure recurring task retries are scheduled according to their recurring schedule, rather than the default `timeout` of the task type.
chrisronline pushed a commit to chrisronline/kibana that referenced this issue Nov 19, 2020
…rring tasks (elastic#83682)

This addresses a bug in Task Manager in the task timeout behaviour. When a recurring task's `retryAt` field is set (which happens at task run), it is currently scheduled to the task definition's `timeout` value, but the original intention was for these tasks to retry on their next scheduled run (originally identified as part of elastic#39349).

In this PR we ensure recurring task retries are scheduled according to their recurring schedule, rather than the default `timeout` of the task type.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants