-
Notifications
You must be signed in to change notification settings - Fork 8.1k
-
Notifications
You must be signed in to change notification settings - Fork 8.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Error handling for Alerting and Actions #39349
Comments
Pinging @elastic/kibana-stack-services |
I think the "at most once" actions approach is too generic. For some error types, it totally makes sense (action is not configured properly and fails validation, for example). For other errors, I think it does not make sense as the error could be transient in nature and the attempt might succeed if tried again later (can't connect to SMTP server, things like that). Seems like we really want a way to flag something that should be retried (or inversely, to flag something that should not be retried). |
After seeing retries live, for actions that were failing because of bugs in my action type executor, I think we DEFINITELY want to - by default - not retry, and allow the executor to indicate that it should go into retry mode explicitly. Eg, if my executor is making an http request (guessing this will be very common), you'd want 40x responses to usually not be retried, you'd want 50x responses to probably be retried (transient error). By looking at the http response, you can probably make a pretty good guess at whether to retry or not. For rate-limiting, Slack sends a 429 response with a header Would of course be great to feed this info into the retry logic. In fact, the Slack action type should track this as well, but what would it do if it knows it's being rate limited but a request is made to fire a slack action? Obviously, it should even bother to make the request to Slack now, I guess it could set the time to run the task when it adds the task to task manager? We clearly are going to need a metric for "how many queued action tasks are in task manager" heh (unbounded queue). |
@pmuellr @bmcconaghy what initial type of retry logic should we do for actions? Example:
|
I'm thinking we should make Probably having some fixed "schedule" for retries available from the ActionType, with a hard-coded default make sense. Here's an npm package (I've never used) that handles "backoffs" generically - https://www.npmjs.com/package/backoff - seems reasonable to me. But again, would like to allow some kind of "don't bother trying again until time X" for the case of rate-limiting with a Guessing that "limits" like "stop after 1 hour" will be hard to reason about for humans, should probably just be "stop after X retry attempts". |
Did the following changes into the description: Actions+ There will be the option to define the max number of attempts
+ There will be the option to determine retry delay Alerting- A multiple of 3 will apply to the next scheduled execution when an alert fails. This will keep multiplying until an execution is successful
+ A multiple of 5 minutes will apply to the number of attempts when an alert fails. This will keep multiplying until an execution is successful |
Should the action options - max # of attempts / delay length - be specified in action instances (when you create an action from an action type)? That seems right, since the retry characteristics are likely to be specific to the action type, and perhaps customizable per action instance. At least that way, it would cut down on the amount of config required for individual alerts. Also, we're not accounting for the case where we "know" when we can retry, like with a 429 response with a Retry-After header. We could punt on that for now, look into supporting it later. If we do end up adding this to Actions, I think we want to ignore it (probably) for the HTTP endpoint used to "fire" an action - they should never be retried, since we're expecting a "synchronous" response back - you don't want an http request to send a slack message to take 5 minutes to return back to the client :-) . Again, perhaps something to deal with in the future ... |
For the first part, if I understand correctly, instead of defining For the second part, action types can define a For the third part, agreed :) And no code changes required to make this work. The API call bypasses task manager. All the logic that we're discussing here would be ignored. |
As discussed with @pmuellr, the Action types will not be able to define The We will wait for feature requests before allowing alert types to define their own retry logic. It isn't as needed on that side since alerts are on intervals and can use the built in back pressure until we have a scenario that needs to customize it. |
Latest changes: Actions- There will be the option to determine retry delay
+ Retry logic will follow suggestion by the executor result Alerting- In the future, alert types will be able to define their own back-pressure formula
- We will cap back-pressure to 1 day max
- In the future, alert types will be able to define their back-pressure cap
- In the future, errors will be able to opt-out from retrying and simply fail the execution until the next regular interval |
@pmuellr @bmcconaghy as per our discussion, latest changes: Alerting- Back-pressure will be in place when alerts fail to execute
- A multiple of 5 minutes will apply to the number of attempts when an alert fails. This will keep multiplying until an execution is successful
+ When an alert fails execution, no retry logic is applied, next execution is at the regular next interval |
Removed providing Alerting- Executors will be provided the number of times the execution has been failing |
Closing as error handling is now implemented and meta alerts issue is created: #49410. |
…rring tasks (#83682) This addresses a bug in Task Manager in the task timeout behaviour. When a recurring task's `retryAt` field is set (which happens at task run), it is currently scheduled to the task definition's `timeout` value, but the original intention was for these tasks to retry on their next scheduled run (originally identified as part of #39349). In this PR we ensure recurring task retries are scheduled according to their recurring schedule, rather than the default `timeout` of the task type.
…rring tasks (elastic#83682) This addresses a bug in Task Manager in the task timeout behaviour. When a recurring task's `retryAt` field is set (which happens at task run), it is currently scheduled to the task definition's `timeout` value, but the original intention was for these tasks to retry on their next scheduled run (originally identified as part of elastic#39349). In this PR we ensure recurring task retries are scheduled according to their recurring schedule, rather than the default `timeout` of the task type.
…rring tasks (#83682) (#83800) This addresses a bug in Task Manager in the task timeout behaviour. When a recurring task's `retryAt` field is set (which happens at task run), it is currently scheduled to the task definition's `timeout` value, but the original intention was for these tasks to retry on their next scheduled run (originally identified as part of #39349). In this PR we ensure recurring task retries are scheduled according to their recurring schedule, rather than the default `timeout` of the task type.
…rring tasks (elastic#83682) This addresses a bug in Task Manager in the task timeout behaviour. When a recurring task's `retryAt` field is set (which happens at task run), it is currently scheduled to the task definition's `timeout` value, but the original intention was for these tasks to retry on their next scheduled run (originally identified as part of elastic#39349). In this PR we ensure recurring task retries are scheduled according to their recurring schedule, rather than the default `timeout` of the task type.
After a design discussion with @bmcconaghy, @epixa, @peterschretlen, @pmuellr and @mikecote, I'm writing down the outcome of what we think error handling should be for Alerting and Actions.
Actions
Alerting
The text was updated successfully, but these errors were encountered: