Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

aws - error handling & concurrency pattern in actions/filters revisit #3531

Open
tjstansell opened this issue Feb 15, 2019 · 30 comments
Open

Comments

@tjstansell
Copy link
Contributor

When specifying a policy on an EBS volume with actions set to delete then notify, delete failures result in the notification still firing. I'm concerned that if I wanted to ensure a snapshot is created before delete, snapshot creation failures might not keep the delete from being attempted.

@kapilt looked and said this module should get cleaned up to properly report errors

@kapilt
Copy link
Collaborator

kapilt commented Feb 15, 2019

thanks for reporting, this seems to be problematic in a few resources. Nutshell an early concurrency and error handle pattern in the code was use concurrent.futures threadpool, where the errors are propagated sans stack trace, but logged only it appears. more modern pattern is to raise from the action/filter at the end before exit if an unexpected exception occurs.

@kapilt kapilt changed the title EBS resource - failure in action does not prevent other actions from running aws - error handling & concurrency pattern in actions/filters review and revisit - error should stop policy exec Feb 15, 2019
@kapilt
Copy link
Collaborator

kapilt commented Feb 15, 2019

@thisisshi ping, curious re your thoughts on this. I do think we're going to need some nuance, and documentation at least in developer docs (if not end user) on how custodian holistically deals with errors, the problem i'd like to address is that its not holistic at the moment from a pattern perspective.

@tjstansell
Copy link
Contributor Author

Not sure if this is something for this issue as well or a new feature request, but curious if there's an established pattern for handling failures in actions. If there was a way to mark an action as always being called (and maybe pass along the previous action's results so it could detect if there were failures), or only call it upon failure of the normal actions, then you could include some sort of error handling logic... even if that logic was just "notify someone upon failure".

@kapilt
Copy link
Collaborator

kapilt commented Feb 16, 2019

there's a specific metric produced on policy execution errors PolicyException, that can be tracked back to an individual policy, and setup for downstream notifies.

@kapilt
Copy link
Collaborator

kapilt commented Feb 16, 2019

there's a lot of nuance that needs to be accounted for on error handling. ie. say a policy is executing against ec2, and terminates 80 out of a 100 instances. we'd still want to process later actions like notify on the 80, else we'd be taking action on the environment and user resources without user notification. i think that was part of the origination of this behavior in some of the modules that operate against ec2.

also quoting from @tjstansell message on gitter that has another scenario that needs consideration as well.

The resource deletion fails, but the notification still goes out. Is there a way to define 
an action to perform any time an error happens and not process the remaining actions? 
Also thinking about the case where I want to take a backup of a dynamodb table before
deleting it... how can I be sure the backup worked before the delete is attempted? And
this is in a lambda-style policy if that makes a difference.

at the moment the general error handling principles for the codebase are

  • unexpected errors should stop policy execution
  • actions/filters on resources with embedded iam policies, are free to continue on in the specific case of permission denied without raising an exception.
  • expected errors can pass silently or log per implementer choice.

the other wrinkle for error handling, is that we use bulk apis wherever possible to minimize the number of api calls. those all behave somewhat differently wrt to errors. Some raise an error on a single item in the set, where we have to parse the error message to retry with the set remainder, some do a failed items in the response (not raised as an exception) and we track those separately.

one option is that we start tracking and filtering successful resources for each policy action to pass down, and track policy resource errors as a separate metric, with a separate resource-errors.json blob store output.

@thisisshi
Copy link
Member

Would it make sense in this case then to only pass the successfully processed resources in the same vein of how filters currently work? e.g. All EBS volumes that are snapshotted then get passed to the delete action. (this would definitely require a lot of work updating all the actions to properly return the correct resources but also solves the immediate problem). However, this is definitely a breaking change between the current behavior, fundamentally I think the question is are actions to be applied to all resources or serially like how we currently process filters?

Maybe the solution would be to introduce a new depends-on syntax,

policies:
  - name: delete-vol-after-snapshot
    resource: ebs
    actions:
      - type: snapshot
      - type: delete
        depends-on: snapshot # only delete resources returned by the snapshot action

Unfortunately this brings in its own set of problems, action types are not unique, we would have to maintain a list of resources from each action return in order to properly pass them into each, and we would have to add a depends-on attribute to all action types.

(have some more thoughts but just noticed that @kapilt just posted, might as well put this out there for now)

@thisisshi
Copy link
Member

error cases:

  • per resource action
    • fail on unexpected error
    • log on expected error
    • return successfully processed resources
  • bulk resource action
    • fail on unexpected error
    • log on expected error
    • if action returns back unsuccessful resources, return set(resources).difference(set(unsuccessful resources))
    • if action does not return back all unsuccessful resources retry until successful and then return successful resources if the invalid resources is returned (could result in n action calls) -- this may need to be tuned for resources where a high cardinality count could cause issues (ebs snapshots for instance)

in either case we may need to expose an additional metric wrt action failure if we're no longer applying all actions to all resources

@kapilt
Copy link
Collaborator

kapilt commented Feb 18, 2019

logging on expected errors seems extraneous. the biggest concern here is the amount of churn we have to get to this. definitely don't want depends-on syntax, thats forcing to much complexity to end users. there's a compatibility concern for reinterpreting action return values for filtered resource sets, currently action returns are serialized json blob store outputs. i'd want to review those, but offhand thats seems okay, afaicr there's only a handful. if we wanted to preserve then it means using an out of band mechanism for noting resource exceptions. there shouldn't be any automated retry at this layer re bulk resource actions, thats for the individual action to determine.

@kapilt kapilt changed the title aws - error handling & concurrency pattern in actions/filters review and revisit - error should stop policy exec aws - error handling & concurrency pattern in actions/filters revisit Feb 18, 2019
@thisisshi
Copy link
Member

yea agreed, the more I thought about it the worse it got, dependency cycle checking, updating the interface, etc.

To preserve the current behavior and simplify the dependency mapping between actions, we could introduce a nested action (maybe also limiting it to one level as well)? precedence for this would be the missing filter, rather than embedding a whole custodian policy we just embed the actions portion.

policies:
  - name: nested-actions
    resource: ebs
    actions:
      - type: snapshot
        actions:
          - type: delete
          - type: notify
            ...

@kapilt
Copy link
Collaborator

kapilt commented Feb 18, 2019

i'm not sure why your proposing new syntax both the nesting or depends-on, what problem are you trying to address by doing that?

@thisisshi
Copy link
Member

nesting would preserve current behavior (applying all actions to all resources), are you proposing that the default behavior be changed to passing in the return set of each action to the next action?

@kapilt
Copy link
Collaborator

kapilt commented Feb 19, 2019

i'm saying the default behavior your referencing compatibility to is, is actually an implementation detail not an interface guarantee, the common expectation has always been that things would stop on error, so there isn't a compatibility concern here. this behavior change is to continue with only non error'd items through the rest of the list before erroring out on policy execution OR to have the action error out then and there to stop policy execution. introducing new syntax feels like complexity in an attempt to preserve unspecified semantics.

@thisisshi
Copy link
Member

Ah, gotcha sounds good to me. Thanks for the clarification 👍

@tjstansell
Copy link
Contributor Author

Perhaps we should let each policy define its own error handling behavior. It seems that each policy has it's own dependencies in that regard. If I have a policy like I've suggested with actions of:

  resource: ebs
  actions:
  - type: snapshot
  - type: delete
  - type: notify

What I'd like is to have the ability for the notify action to always fire and be able to report on all resources, but show which actions were successful and which were failures. I would also assume that only resources that were successfully snapshotted would get deleted. So maybe the resource list passed to each action should only include the successfully processed resources from the previous action, but the return status from all previous actions and all previous resources could be available for the action to reference?

Or maybe it could support some sort of on_error setting where you can define a single(?) action to perform on any resources that failed at a particular action (where successful resources would keep progressing through the list of actions):

  resource: ebs
  actions:
  - type: snapshot
    on_error:
      type: notify
      msg: These resources failed to snapshot.
  - type: delete
    on_error:
      type: notify
      msg: These resources were snapshot, but failed to delete.
  - type: notify
      msg: These resources were deleted after being snapshotted.

I was also brainstorming the idea of supporting something like a try, except, finally syntax to define actions:

  resource: ebs
  actions:
    - try:
      - type: snapshot
      - type: delete
    - finally:
      - type: notify

but I'm not sure there's really a case for an except block ... but i do like the explicit nature of how that shows the notify action will always be attempted regardless of how many try actions worked. Again, I think the interesting part here, is what the payload looks like that's being passed to the finally action. Somehow, it would be great if the status if all of the resources actions could be referenced, but I honestly don't know how complicated that might get ... both to track and pass down as well as for each action to parse.

I guess in my head the resources listed for each action would represent which resources that action should take action on. This would be the set that have passed all previous actions. There could also be a results section somewhere that would show the results of all previous actions.

Anyway, that's a long ramble, but as a consumer of this I wanted to throw my two cents in.

Thanks!

@tjstansell
Copy link
Contributor Author

tjstansell commented Mar 22, 2019

It seems there should be a good and consistent way to prune non-existent resources from the list as well ... i've run into situations where an EBS volume will get deleted between the time the resources are queried and when custodian goes to tag the resources. Unfortunately, this results in the entire tag effort failing for all resources. It would be ideal if it could detect that these resources no longer exist and remove them from the set of resources it's working on. The same goes for having multiple policies that might act on resources at the same time... as changes are made to resources, policies that run later end up using the cached resource data instead which doesn't reflect the current state of the resource (like tag changes, if an instance has been deleted, etc).

This is similar to what is reported in #2892 with the boto error being:

An error occurred (InvalidVolume.NotFound) when calling the CreateTags operation: The volume 'vol-0db6750be49003960' does not exist.

@anovis
Copy link
Contributor

anovis commented Mar 23, 2019

i was also thinking about adding functionality regarding exception handling. For example, the max-resource configuration is an exception so errors out when triggered, but i might want to trigger a notify action instead. so similar to @tjstansell

 max-resource:50
 resource: ebs
 actions:
  - type: delete
 on_error:
  - exception: ResourceLimitExceeded
    type: notify 

@kapilt
Copy link
Collaborator

kapilt commented Apr 2, 2019

@anovis this issue is about something unrelated imo. if you want notifies of policy errors, they are already published as a metric, and alert can be set on the metrics for notifies.

@kapilt
Copy link
Collaborator

kapilt commented Apr 2, 2019

@tjstansell we really, really don't want to introduce programming in yaml constructs. Also it doesn't map well to any of the implementation details of the issue at hand, which is handling failed resources. The proposal here is something different that gets to the same end goal, showing which resources failed at which stage when dealing with partial success via separate json file for them in blob store, and does that without introducing new configuration syntax.

@tjstansell
Copy link
Contributor Author

I'm not sure I understand the proposal you're talking about that handles the situations I've mentioned. Reading through the comments here again, everything feels like a brainstorming session (including my own suggestions). If there's a proposal that lets me define how to handle errors, great. It's not clear to me yet, though, what that proposal is.

@kapilt
Copy link
Collaborator

kapilt commented Apr 2, 2019

"one option is that we start tracking and filtering successful resources for each policy action to pass down, and track policy resource errors as a separate metric, with a separate resource-errors.json blob store output."

nutshell its not safe to do anything with failed resources, we don't know why they failed. so we remove them from the set, store them, and process the remainder.

adding a bunch of configuration here is problematic imo, its just adding complexity that the user doesn't want to or in many cases even knows what to do with.

@tjstansell
Copy link
Contributor Author

What about the case where you have multiple actions ... there should be a way to at least notify folks of the actions that successfully happened to resources, even if not all actions were successful. Are you proposing that we need some way to monitor the metrics to detect when something failed, then somehow determine which resources failed, which relies on digging through cloudwatch logs from lambdas? ... and create a notification from that? I'm not even sure how someone would do that.

I mean, moving to a place where we know a resource will move on to the next step only if it succeeds is great... (which is basically how I think it's supposed to be working today) but I think many of us are hoping to have more control than that. I was really hoping that at the end of each action there was a way to trigger a specific action (probably just notify) for anything that failed. Or, at the end of the entire policy run, allow a way to call notify and include a payload where we can determine which step each resource successfully completed.

I'm guessing that most of us just want a rich notification that accurately represents what happened rather than wanting to try to do any other specific actions upon failure.

Perhaps chaining policies is one way to achieve this, but the current limitations of how caching works would prevent you from doing both actions in a single run. For instance: a policy to take a backup of a dynamodb table, then a second policy that deletes that table, but filtered only on tables who've had a backup within X hours (not that that filter exists yet, afaik). You'd have to run custodian with just the first policy, then again with the second to keep the original resource state that's cached from being used in the second policy.

@anovis
Copy link
Contributor

anovis commented Apr 3, 2019

cloudwatch metrics for exceptions just tracks the number of occurrences and provides no additional details so you would always be forced to go to the logs even if there are exceptions that you don't want to worry about. being able to notify and pass the exception along to sns or sqs would enable an easier workflow. for example you can send the exceptions to an sqs queue and parse the exception to send one type of exceptions to one team and other exceptions to another team.

another thing to consider is that as we move to move towards more multi cloud support we wouldn't want users to have 3 different solutions to monitor and create custom notify actions from the exception metrics depending on if they output to cloudwatch, or the gcp, azure equivalents.

self.metrics.put_metric('PolicyException', 1, "Count")

@kapilt
Copy link
Collaborator

kapilt commented Apr 3, 2019

@tjstansell blob storage aka -s is the system of record for a custodian run/execution and contains logs, metrics, resources, metadata, etc. i'm proposing we extend that with an additional blob for resources that failed any filter/action along with metadata about that action/error. if you see a policy metric error, you go check the resource error blob and execution logs for the execution which are sitting in blob storage (or alternatively index them directly for alerts).

@anovis re multi-provider the intent is that we'll allow them to mix and match on outputs from providers.

imo, adding five-10 lines of config per policy as boiler plate is not a good user experience. the reality is on error your always going to want to look at the logs and outputs for more context.

maybe we'll get to notify, but frankly if we don't have the first step which is a large undertaking (ie. actually recording what error'd, where, why), its moot imo, anyways since your not notifying on anything relating to what resources were actually processed or error'd or what step they did, at which point alerting on the metrics and then looking at blob output or logs is better.

@tjstansell
Copy link
Contributor Author

nutshell its not safe to do anything with failed resources, we don't know why they failed. so we remove them from the set, store them, and process the remainder.

I'm revisiting this, as we continue to have problems with invalid notifications (claiming actions were taken when they actually failed) or are not confident on chaining actions (e.g. snapshot + delete).

What if we simply moved failed resources to a failed set in the payload we send to the next action. The actions could stay the same as today where they act only on the resources set, which means the actions continue as long as they succeed for a resource. Any resources that fail will then stop being processed, but succeeding ones will continue. However, some actions (e.g. notify) could know how to also look at the failed set of resources so proper reporting could still happen. in theory you could then chain in some invoke-lambda actions that specifically look for failed resources at each step, if needed. no additional config syntax... each step simply knows which set of resources to look at.

The biggest hurdle IMO would be to update all of the actions to return a consistent result ... a list of resource ids that succeeded and a list that failed (ideally with a reason)... if an uncaught exception happens, then the entire action set will stop being processed and the whole policy execution could stop just like today.

@tjstansell
Copy link
Contributor Author

FWIW, I'm working on a prototype ... to see if we can introduce some structure to the actions and get some consistency here. Currently doing down the road of annotating actions taken with a new c7n:Actions key being added to the resources and wrapping the initial call to process() from the policy execution so it automatically filters which resources previously failed and which ones succeeded ... will link a PR for review once I get something that seems potentially reasonable.

@tjstansell
Copy link
Contributor Author

OK ... here's a more final proposal on what I've been working on after playing with it and actually getting things to work:

  • Each action is passed the full set of resources after filters have been applied
  • Each resource records the action state as its processed during an action. These states can be ok, skip, or error.
  • Any resources that did not record a state by the end of the action will get an ok status applied unless an exception is caught, in which case an error state will be applied.
  • these action states are recorded in a new c7n:Actions attribute in the resource, which is a list. There should be one result recorded for every action in the policy.
  • actions by default will filter out any previously failed resources, so they only act on those with only successes.
  • some actions can opt-in to including failed resources, which means they simply don't filter them out. This is useful for the external things like invoke-function, notify, etc. that might want to report on or act on resources that failed an action.

There are some changes in behavior that result from this:

  • exceptions during an action will not cause the entire policy to stop execution. Instead, each resource is marked as failed and passed on to followup actions. This is potentially useful for notify actions to be able to send notices about these failures instead of the execution only logging an error.
  • some tests will need to be updated, as execution errors no longer cause an exception at the policy level, but instead result in all resources being marked as failed.
  • actions that pre-filter resources previously had the side-affect of removing those resources from the resource list being acted on ... which removed them from further actions. we will need to make sure to mark these as error to keep these resources from being included in further actions.

There are several places where the set of resources being acted on is pre-filtered inside the action. For instance, stopping an ec2 instance only affects instances in the running state. instead of just filtering, we now need to look at each of these and determine if something should be considered a failure. For instance, if the instance is already in the stopped state, we'd mark the resource as skip because it's already in the desired state. If it is not stoppable, rather than skipping it we should now mark it as an error. Basically, if the resource is not in the desired final state when we're done, it should be marked as an error.

@tjstansell
Copy link
Contributor Author

@kapilt @anovis @JohnHillegass

Pinging a few folks to hopefully resume this discussion and get some visibility to the associated PR #5790. I'd love it if we could get some traction on having a more strict contract with actions and how they're handled.

My summarized proposal above documents what the PR ended up evolving to become after doing a lot of testing. For some reason, I was under the impression before that actions that pre-filtered resources before acting on them would somehow affect the set of resources sent to followup actions. So, the PR includes a bunch of changes related to all of the pre-filtering to correctly categorize whether the resource was a success or failure. Those changes could be broken out into a separate PR, if needed.

Regardless, would love to hear what folks think about this proposal. To summarize the post above, we simply mark each resource with a status (ok, skip, error) and that gets passed on as an annotation in the resource. Follow on actions will by default ignore previously errored resources, but could optionally include those for actions like notify.

tjstansell added a commit to tjstansell/cloud-custodian that referenced this issue Jun 8, 2020
This should address points in issue cloud-custodian#3531:
  - a resource should be guaranteed to only move to followup actions upon success
  - some actions should allow accessing failed resources (e.g. notify)
  - uncaught exceptions in actions should not affect resources that
    previously succeeded.  they should continue to followup actions.

This PR is not complete yet... but a good starting point for review.
@JohnHillegass
Copy link
Collaborator

JohnHillegass commented Jun 10, 2020

This makes me meditate on PEP 20

Errors should never pass silently

What kind of follow up actions could be taken on failed resources? Could there be scenarios where failures conditions aren't as clear without stack trace, runtime context etc? What would happen for existing policies that do not have notify actions for reporting the errored resources?

@tjstansell
Copy link
Contributor Author

tjstansell commented Jun 10, 2020

I’m not sure exactly what kind of actions would actually act on failed resources other than possibly a notify action. But conceivably someone might want to run invoke-lambda on them and do something special.

As for error handling, it would be no different than today. There’s still an error log generated. But instead of the whole policy just stopping execution, successful resources continue on.

I’m also envisioning a simple way to filter which resources take part in an action, for some actions anyway (like notify). For instance, you could have a snapshot action, then a notify, then delete, then notify. The first notify could select only failed resources so you’re notifying on error and the notification would know the snapshot failed for some reason. Similarly, the second notify could be left to defaults and only notify about successful resources (the same behavior as today really). You could also have a third notify that selects only failed resources but only those that failed the delete action. It too could have specific wording so the folks knew a snapshot was created but the deletion failed.

You could also choose to have a single notify that selects all resources and have your notification provide the details about each resource... here are the ones that failed for some reason and here are the ones that succeeded.

This intrinsic filtering of which resources are processed by an action should, IMO, be restricted to notify and possibly invoke-lambda. Everything else would always act only on resources that have not failed. Oh, and for actions like this, it should probably record a “skip” for any resource it didn’t process. So every resource should therefore have some result for every action in the policy.

@tjstansell
Copy link
Contributor Author

I started thinking about what kind of filtering would be most common for notify actions ... or any other, really ... and came up with this proposal:

Use a new with_result parameter to determine which resources to include in the action with these possible values:

  • ok, skip, error -- look for this specific result from the last action
  • success -- resources that have not had any errors from any previous actions (this would be the default)
  • failure -- resources that have had errors in any previous actions
  • any -- all resources are included ... no filtering

This new with_result parameter would only be available to the following actions:

  • notify
  • invoke-lambda
  • webhook

Here's an example:

actions:
  - snapshot
  - type: notify
    with_result: error
    action_desc: These resources failed to get snapshotted before being deleted.
  - delete
  - type: notify
    with_result: error
    action_desc: These resources had a snapshot created, but could not be deleted.
  - type: notify
     action_desc: These resources have had a snapshot created and have been deleted.

or alternatively:

actions:
  - snapshot
  - delete
  - type: notify
    with_result: failure
    action_desc: These resources have failed to be snapshotted or deleted.
  - type: notify
    with_result: success
    action_desc: These resources have had a snapshot created and have been deleted.
  - type: invoke-lambda
    function: my-function
    with_result: any

which would send a failure notification, success notification, then invoke a lambda with all the resources and the lambda could look at the c7n:Actions annotation to see if it failed or succeeded or whatever and do something else.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants