New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
aws - error handling & concurrency pattern in actions/filters revisit #3531
Comments
thanks for reporting, this seems to be problematic in a few resources. Nutshell an early concurrency and error handle pattern in the code was use concurrent.futures threadpool, where the errors are propagated sans stack trace, but logged only it appears. more modern pattern is to raise from the action/filter at the end before exit if an unexpected exception occurs. |
@thisisshi ping, curious re your thoughts on this. I do think we're going to need some nuance, and documentation at least in developer docs (if not end user) on how custodian holistically deals with errors, the problem i'd like to address is that its not holistic at the moment from a pattern perspective. |
Not sure if this is something for this issue as well or a new feature request, but curious if there's an established pattern for handling failures in actions. If there was a way to mark an action as always being called (and maybe pass along the previous action's results so it could detect if there were failures), or only call it upon failure of the normal actions, then you could include some sort of error handling logic... even if that logic was just "notify someone upon failure". |
there's a specific metric produced on policy execution errors |
there's a lot of nuance that needs to be accounted for on error handling. ie. say a policy is executing against ec2, and terminates 80 out of a 100 instances. we'd still want to process later actions like notify on the 80, else we'd be taking action on the environment and user resources without user notification. i think that was part of the origination of this behavior in some of the modules that operate against ec2. also quoting from @tjstansell message on gitter that has another scenario that needs consideration as well.
at the moment the general error handling principles for the codebase are
the other wrinkle for error handling, is that we use bulk apis wherever possible to minimize the number of api calls. those all behave somewhat differently wrt to errors. Some raise an error on a single item in the set, where we have to parse the error message to retry with the set remainder, some do a failed items in the response (not raised as an exception) and we track those separately. one option is that we start tracking and filtering successful resources for each policy action to pass down, and track policy resource errors as a separate metric, with a separate resource-errors.json blob store output. |
Would it make sense in this case then to only pass the successfully processed resources in the same vein of how filters currently work? e.g. All EBS volumes that are snapshotted then get passed to the delete action. (this would definitely require a lot of work updating all the actions to properly return the correct resources but also solves the immediate problem). However, this is definitely a breaking change between the current behavior, fundamentally I think the question is are actions to be applied to all resources or serially like how we currently process filters? Maybe the solution would be to introduce a new policies:
- name: delete-vol-after-snapshot
resource: ebs
actions:
- type: snapshot
- type: delete
depends-on: snapshot # only delete resources returned by the snapshot action Unfortunately this brings in its own set of problems, action types are not unique, we would have to maintain a list of resources from each action return in order to properly pass them into each, and we would have to add a depends-on attribute to all action types. (have some more thoughts but just noticed that @kapilt just posted, might as well put this out there for now) |
error cases:
in either case we may need to expose an additional metric wrt action failure if we're no longer applying all actions to all resources |
logging on expected errors seems extraneous. the biggest concern here is the amount of churn we have to get to this. definitely don't want depends-on syntax, thats forcing to much complexity to end users. there's a compatibility concern for reinterpreting action return values for filtered resource sets, currently action returns are serialized json blob store outputs. i'd want to review those, but offhand thats seems okay, afaicr there's only a handful. if we wanted to preserve then it means using an out of band mechanism for noting resource exceptions. there shouldn't be any automated retry at this layer re bulk resource actions, thats for the individual action to determine. |
yea agreed, the more I thought about it the worse it got, dependency cycle checking, updating the interface, etc. To preserve the current behavior and simplify the dependency mapping between actions, we could introduce a nested action (maybe also limiting it to one level as well)? precedence for this would be the policies:
- name: nested-actions
resource: ebs
actions:
- type: snapshot
actions:
- type: delete
- type: notify
... |
i'm not sure why your proposing new syntax both the nesting or depends-on, what problem are you trying to address by doing that? |
nesting would preserve current behavior (applying all actions to all resources), are you proposing that the default behavior be changed to passing in the return set of each action to the next action? |
i'm saying the default behavior your referencing compatibility to is, is actually an implementation detail not an interface guarantee, the common expectation has always been that things would stop on error, so there isn't a compatibility concern here. this behavior change is to continue with only non error'd items through the rest of the list before erroring out on policy execution OR to have the action error out then and there to stop policy execution. introducing new syntax feels like complexity in an attempt to preserve unspecified semantics. |
Ah, gotcha sounds good to me. Thanks for the clarification 👍 |
Perhaps we should let each policy define its own error handling behavior. It seems that each policy has it's own dependencies in that regard. If I have a policy like I've suggested with actions of:
What I'd like is to have the ability for the Or maybe it could support some sort of
I was also brainstorming the idea of supporting something like a try, except, finally syntax to define actions:
but I'm not sure there's really a case for an I guess in my head the Anyway, that's a long ramble, but as a consumer of this I wanted to throw my two cents in. Thanks! |
It seems there should be a good and consistent way to prune non-existent resources from the list as well ... i've run into situations where an EBS volume will get deleted between the time the resources are queried and when custodian goes to tag the resources. Unfortunately, this results in the entire tag effort failing for all resources. It would be ideal if it could detect that these resources no longer exist and remove them from the set of resources it's working on. The same goes for having multiple policies that might act on resources at the same time... as changes are made to resources, policies that run later end up using the cached resource data instead which doesn't reflect the current state of the resource (like tag changes, if an instance has been deleted, etc). This is similar to what is reported in #2892 with the boto error being:
|
i was also thinking about adding functionality regarding exception handling. For example, the max-resource configuration is an exception so errors out when triggered, but i might want to trigger a notify action instead. so similar to @tjstansell
|
@anovis this issue is about something unrelated imo. if you want notifies of policy errors, they are already published as a metric, and alert can be set on the metrics for notifies. |
@tjstansell we really, really don't want to introduce programming in yaml constructs. Also it doesn't map well to any of the implementation details of the issue at hand, which is handling failed resources. The proposal here is something different that gets to the same end goal, showing which resources failed at which stage when dealing with partial success via separate json file for them in blob store, and does that without introducing new configuration syntax. |
I'm not sure I understand the proposal you're talking about that handles the situations I've mentioned. Reading through the comments here again, everything feels like a brainstorming session (including my own suggestions). If there's a proposal that lets me define how to handle errors, great. It's not clear to me yet, though, what that proposal is. |
"one option is that we start tracking and filtering successful resources for each policy action to pass down, and track policy resource errors as a separate metric, with a separate resource-errors.json blob store output." nutshell its not safe to do anything with failed resources, we don't know why they failed. so we remove them from the set, store them, and process the remainder. adding a bunch of configuration here is problematic imo, its just adding complexity that the user doesn't want to or in many cases even knows what to do with. |
What about the case where you have multiple actions ... there should be a way to at least notify folks of the actions that successfully happened to resources, even if not all actions were successful. Are you proposing that we need some way to monitor the metrics to detect when something failed, then somehow determine which resources failed, which relies on digging through cloudwatch logs from lambdas? ... and create a notification from that? I'm not even sure how someone would do that. I mean, moving to a place where we know a resource will move on to the next step only if it succeeds is great... (which is basically how I think it's supposed to be working today) but I think many of us are hoping to have more control than that. I was really hoping that at the end of each action there was a way to trigger a specific action (probably just I'm guessing that most of us just want a rich notification that accurately represents what happened rather than wanting to try to do any other specific actions upon failure. Perhaps chaining policies is one way to achieve this, but the current limitations of how caching works would prevent you from doing both actions in a single run. For instance: a policy to take a backup of a dynamodb table, then a second policy that deletes that table, but filtered only on tables who've had a backup within X hours (not that that filter exists yet, afaik). You'd have to run custodian with just the first policy, then again with the second to keep the original resource state that's cached from being used in the second policy. |
cloudwatch metrics for exceptions just tracks the number of occurrences and provides no additional details so you would always be forced to go to the logs even if there are exceptions that you don't want to worry about. being able to notify and pass the exception along to sns or sqs would enable an easier workflow. for example you can send the exceptions to an sqs queue and parse the exception to send one type of exceptions to one team and other exceptions to another team. another thing to consider is that as we move to move towards more multi cloud support we wouldn't want users to have 3 different solutions to monitor and create custom notify actions from the exception metrics depending on if they output to cloudwatch, or the gcp, azure equivalents.
|
@tjstansell blob storage aka -s is the system of record for a custodian run/execution and contains logs, metrics, resources, metadata, etc. i'm proposing we extend that with an additional blob for resources that failed any filter/action along with metadata about that action/error. if you see a policy metric error, you go check the resource error blob and execution logs for the execution which are sitting in blob storage (or alternatively index them directly for alerts). @anovis re multi-provider the intent is that we'll allow them to mix and match on outputs from providers. imo, adding five-10 lines of config per policy as boiler plate is not a good user experience. the reality is on error your always going to want to look at the logs and outputs for more context. maybe we'll get to notify, but frankly if we don't have the first step which is a large undertaking (ie. actually recording what error'd, where, why), its moot imo, anyways since your not notifying on anything relating to what resources were actually processed or error'd or what step they did, at which point alerting on the metrics and then looking at blob output or logs is better. |
I'm revisiting this, as we continue to have problems with invalid notifications (claiming actions were taken when they actually failed) or are not confident on chaining actions (e.g. snapshot + delete). What if we simply moved failed resources to a The biggest hurdle IMO would be to update all of the actions to return a consistent result ... a list of resource ids that succeeded and a list that failed (ideally with a reason)... if an uncaught exception happens, then the entire action set will stop being processed and the whole policy execution could stop just like today. |
FWIW, I'm working on a prototype ... to see if we can introduce some structure to the actions and get some consistency here. Currently doing down the road of annotating actions taken with a new |
OK ... here's a more final proposal on what I've been working on after playing with it and actually getting things to work:
There are some changes in behavior that result from this:
There are several places where the set of resources being acted on is pre-filtered inside the action. For instance, stopping an ec2 instance only affects instances in the |
@kapilt @anovis @JohnHillegass Pinging a few folks to hopefully resume this discussion and get some visibility to the associated PR #5790. I'd love it if we could get some traction on having a more strict contract with actions and how they're handled. My summarized proposal above documents what the PR ended up evolving to become after doing a lot of testing. For some reason, I was under the impression before that actions that pre-filtered resources before acting on them would somehow affect the set of resources sent to followup actions. So, the PR includes a bunch of changes related to all of the pre-filtering to correctly categorize whether the resource was a success or failure. Those changes could be broken out into a separate PR, if needed. Regardless, would love to hear what folks think about this proposal. To summarize the post above, we simply mark each resource with a status (ok, skip, error) and that gets passed on as an annotation in the resource. Follow on actions will by default ignore previously errored resources, but could optionally include those for actions like |
This should address points in issue cloud-custodian#3531: - a resource should be guaranteed to only move to followup actions upon success - some actions should allow accessing failed resources (e.g. notify) - uncaught exceptions in actions should not affect resources that previously succeeded. they should continue to followup actions. This PR is not complete yet... but a good starting point for review.
This makes me meditate on PEP 20
What kind of follow up actions could be taken on failed resources? Could there be scenarios where failures conditions aren't as clear without stack trace, runtime context etc? What would happen for existing policies that do not have notify actions for reporting the errored resources? |
I’m not sure exactly what kind of actions would actually act on failed resources other than possibly a notify As for error handling, it would be no different than today. There’s still an error log generated. But instead of the whole policy just stopping execution, successful resources continue on. I’m also envisioning a simple way to filter which resources take part in an action, for some actions anyway (like notify). For instance, you could have a You could also choose to have a single notify that selects all resources and have your notification provide the details about each resource... here are the ones that failed for some reason and here are the ones that succeeded. This intrinsic filtering of which resources are processed by an action should, IMO, be restricted to notify and possibly invoke-lambda. Everything else would always act only on resources that have not failed. Oh, and for actions like this, it should probably record a “skip” for any resource it didn’t process. So every resource should therefore have some result for every action in the policy. |
I started thinking about what kind of filtering would be most common for Use a new
This new
Here's an example: actions:
- snapshot
- type: notify
with_result: error
action_desc: These resources failed to get snapshotted before being deleted.
- delete
- type: notify
with_result: error
action_desc: These resources had a snapshot created, but could not be deleted.
- type: notify
action_desc: These resources have had a snapshot created and have been deleted. or alternatively: actions:
- snapshot
- delete
- type: notify
with_result: failure
action_desc: These resources have failed to be snapshotted or deleted.
- type: notify
with_result: success
action_desc: These resources have had a snapshot created and have been deleted.
- type: invoke-lambda
function: my-function
with_result: any which would send a failure notification, success notification, then invoke a lambda with all the resources and the lambda could look at the |
When specifying a policy on an EBS volume with actions set to
delete
thennotify
, delete failures result in the notification still firing. I'm concerned that if I wanted to ensure a snapshot is created before delete, snapshot creation failures might not keep the delete from being attempted.@kapilt looked and said this module should get cleaned up to properly report errors
The text was updated successfully, but these errors were encountered: