Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Semantics for destructive operations design proposal #1148

Merged
merged 2 commits into from Jan 9, 2023

Conversation

RedbackThomson
Copy link
Contributor

Issue #, if available: #82

Description of changes:
Draft proposal for mechanisms to protect users against accidental deletion of ACK and/or AWS resources

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

Copy link
Collaborator

@jaypipes jaypipes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am opposed to the entire "deletion protection" feature, for reasons stated inline in my comments.

I would be fine with a "unadopt resource" or "unmanage resource" explicit action and user experience, however.

Please rename this proposal to Deletion protection because I'd like to see a separate proposal on cascading delete functionality (which I consider to be more aligned with the declarative configuration management principles that ACK should be encouraging).


* Helm uses an annotation: `helm.sh/resource-policy: keep`
* Google Config Connector uses an annotation: `cnrm.cloud.google.com/deletion-policy: abandon`
* Skaffold uses a label: `skaffold.dev/cleanup: "false"`
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There were also proposals originally to use an annotation/label for our resource adoption functionality. Instead of doing the annotation/label on a CR, we chose a user experience where the user needs to explicitly adopt an AWS resource by creating a services.k8s.aws/AdoptedResource CR. I continue to believe that we made the right decision in this regard: an explicit adoption action (via the kubectl create -f <services.k8s.aws/AdoptedResource> process) is the least surprising, hardest to misuse interface.

For the reverse of adoption action, I would prefer to have a similarly explicit action, essentially an "unadopt" or "unmanage" action. We could do this in a couple ways:

  • Create a new kubectl plugin/verb called kubectl unmanage that the user could call to delete the CR but keep the AWS resource around
  • kubectl delete services.k8s.aws/AdoptedResource/<resource arn> and have the AdoptedResourceReconciler run some "delete the CR but keep the AWS resource around" code


#### Examples

Definition at the controller - all resources reconciled by this controller will use the “retain” policy (unless overridden by an annotation):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not think "retain" should be the default use-case. We can keep current implementation as default i.e. Deleting a resource cleans things up on AWS side as well. In my opinion this will be default thinking of most users.

And option to override this cli-option for production use, stateful resources and databases will give users ability to prevent accidental deletion.


Another less popular option can be if we get opinionated about AWS service controllers and store metadata about which service controllers will have "retain" as default option. Then helm/kustomization artifacts will only have deletion policy "retain" for those service controllers. However this bring inconsistency among service controllers by default.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not think "retain" should be the default use-case

"delete" is the default. All default ACK functionality will remain the same for all considerations of this proposal. Only in this particular example do we override the default and set it to be "retain" at the command line argument level.

And option to override this cli-option for production use, stateful resources and databases will give users ability to prevent accidental deletion.

That is what this example is attempting to show.

@RedbackThomson
Copy link
Contributor Author

RedbackThomson commented Jan 28, 2022

@jaypipes

I am opposed to the entire "deletion protection" feature, for reasons stated inline in my comments.

I actually do agree with this. I wanted to write this proposal to get critical feedback like this. As I was writing the proposal I felt that we would be overengineering functionality. However, I still felt it was important to explore the technical challenges involved just so that we could all see the pros and cons of the function.

I will be removing the "Deletion protection" from this proposal - I might add an appendix where I mention this discussion to keep it for posterity. I will pivot this proposal into discussing only the "AWS Removal Policy" functionality.

On that front:

For the reverse of adoption action, I would prefer to have a similarly explicit action, essentially an "unadopt" or "unmanage" action

The reason we chose to have a separate CRD for adoption was mainly because we could not fit enough information into the existing CRD (or metadata) of the ACK resources, or we would have had to significantly redesign the optional/required fields. Further, the adopted resource CR acts as a marker for, and stores the state for, the adoption reconciler to create a new ACK resource which the controller then reconciles as if it were previously created.

I don't believe the removal policy needs anything as sophisticated as this. This proposal only addresses whether the controller will or will not call Delete* in the SDK before it deletes the finalizer. We do not need to modify any elements of the spec or metadata (apart from this single annotation) on the ACK resource. The ACK resource is also immediately deleted after the delete operation, so having a remaining "Unadoption" custom resource would essentially only act as a log.

Your kubectl unadopt alias could easily be made, and would be as simple as:

kubectl annotate bucket my-bucket services.k8s.aws/deletion-policy=retain
kubectl delete bucket my-bucket

Copy link
Member

@vijtrip2 vijtrip2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this proposal. Great job!
I like how you broke the problem into two main parts, and referenced similar issues from other communities too.

I just left some clarifying questions and small comments.

Copy link
Collaborator

@jaypipes jaypipes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still not in favour of any of this feature as described in this proposal. The only thing I would be in favour of is something equivalent to an explicit kubectl unadopt command.

The absolute last UI we should be emulating is CloudFormation, IMHO.


When deleting an ACK custom resource, the underlying AWS resource is first deleted. Users should expect this to be the default behaviour, as creating and updating CR similarly have the same respective actions in AWS. However users have requested that some resources should continue to live on after the ACK CR has been deleted. Resources with stateful data, such as S3 buckets, RDS instance or EBS volumes, may want to be retained so that they may be migrated between ACK installations or off ACK entirely without requiring they be backed up, deleted and then restored.

Similar to the CDK[`RemovalPolicy` type](https://docs.aws.amazon.com/cdk/api/v2//docs/aws-cdk-lib.RemovalPolicy.html#members) and the native [K8s PV reclaim policy](https://kubernetes.io/docs/concepts/storage/persistent-volumes/#reclaiming), ACK should support the ability to leave the AWS resource alive while still deleting the custom resource object.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll be honest here... any time I see a reference to CloudFormation's (this is CDK, yes, but it's all CloudFormation...) DeletionPolicy, I shudder. I really despise the entire user experience of configuring some tool to behave different ways when the user takes a specific action (the action in this case is the DELETE action. This is a non-obvious interface and I am opposed to it. Instead, we should encourage a user experience that requires the user to explicitly call a different action in order to have the tool behave differently.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After chatting with @RedbackThomson a bit more about this, I'm going to take a slightly less rigid approach to this. Kubernetes API unfortunately does not have a "control channel" for which to send instructions to a controller about how to handle a particular action. Instead, we have the metadata.annotations Hammer which is unfortunate IMHO because this is on the resource and not where it should be: as decorative data on the request.

Alas, the metadata.annotations Hammer is all we have so I will reconsider this proposal in light of that and see if I can come up with a variation that doesn't offend my sense of poor coupling.

@ack-bot
Copy link
Collaborator

ack-bot commented May 17, 2022

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close.
Provide feedback via https://github.com/aws-controllers-k8s/community.
/lifecycle stale

@ack-bot ack-bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 17, 2022
@RedbackThomson
Copy link
Contributor Author

/remove-lifecycle stale

@ack-bot ack-bot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 18, 2022
@aws-controllers-k8s aws-controllers-k8s deleted a comment from ack-bot May 18, 2022
@bwhaley
Copy link

bwhaley commented Jul 13, 2022

Having read the proposal and the comments, I can see that there are some strong opinions and fierce opposition to this feature, so perhaps the point is moot. Nevertheless, I'd like to weigh in for some additional perspective since my team is starting to evaluate ACK and this was one area of concern.

Many teams, ours included, use k8s only for stateless applications. Losing a deployment/service/configmap due to an erroneous merge is easy to rollback and no permanent damage is done. Losing an RDS instance or S3 bucket, OTOH, could be a disaster from which there is no recovery. For that reason, some kind of protection would be welcomed.

Additionally, when using managed Kubernetes services like EKS in which the customer has no control plane access, there's a certain nervous tick that something could go wrong in etcd completely outside of customer control. What if a weird bug in EKS caused objects in etcd to disappear and as a result the underlying AWS resources are deleted? Some of the protections mentioned in the proposal might protect from this eventuality (e.g. a policy on the controller). Perhaps this is highly unlikely/nigh impossible, but is it fair to write off this worry entirely? We've already experienced unexplainable behavior with etcd on EKS (not this specific thing) so I think the concern is well founded.

"Unadopting" resources is only useful when removal is planned. It's the unplanned errors that I'd worry about.

I do hope this proposal will remain open for others to offer still more perspectives.

@RedbackThomson
Copy link
Contributor Author

RedbackThomson commented Jul 14, 2022

What if a weird bug in EKS caused objects in etcd to disappear and as a result the underlying AWS resources are deleted

Although unlikely, I understand your point. The controller is designed to delete resources when it sees the deletionTimestamp property is set to a non-nil value. Disappearing from etcd is less of a concern, but some other tool setting this value on behalf on an unwilling user is a fair problem.

I do still agree with @jaypipes with respect to not having deletion protection. We would essentially be recreating RBAC in a very naive way, introducing yet another mechanism for people to configure permissions on top of K8s RBAC and AWS IAM. Those two systems are far better suited for that style of fine-grained access control. If you really, really wanted to make sure your production systems can't be deleted by an ACK controller, you could even modify the IAM role associated with the controller and deny all Delete:* permissions.

However, the proposal for adding a retain setting on resources I think still stands. There may be situations where you'd like to migrate stateful applications between ACK-enabled clusters. Setting retain on each of your stateful application infrastructure elements, deleting them from the cluster and then adopting them into the new cluster would greatly reduce the chance of accidental deletion. Furthermore, proactively setting retain on stateful resources would protect those resources from being accidentally deleted by someone who doesn't realise they're in the wrong cluster context.

This proposal will remain open until either someone decides to act on it - doesn't have to be the ACK core team, this feature is open for anyone to take a crack at - or until it is dismissed for one reason or another.

@bwhaley
Copy link

bwhaley commented Jul 14, 2022

We would essentially be recreating RBAC in a very naive way, introducing yet another mechanism for people to configure permissions on top of K8s RBAC and AWS IAM.

AFAICT this proposal or any resource protection feature doesn't resemble RBAC in the least. There isn't anything about access control or permissions. Rather, this is about how to decide whether a resource should be deleted.

However, the proposal for adding a retain setting on resources I think still stands

Interesting. So you're suggesting something like (though not exactly) the following:

s3.services.k8s.aws/deletion-policy: retain

for any individual resource, but not for an entire controller or namespace? Am I understanding your point correctly?

If so, that's great, but that smells like a deletion policy to me.

@RedbackThomson
Copy link
Contributor Author

AFAICT this proposal or any resource protection feature doesn't resemble RBAC in the least. There isn't anything about access control or permissions. Rather, this is about how to decide whether a resource should be deleted.

Oh, actually, I removed that from the proposal altogether. That's what the majority of the disagreement was about - designing an annotation that prevents deletion of a given resource. Nevermind, then.

for any individual resource, but not for an entire controller or namespace

Correct. This is the current intention for this version of the proposal.

smells like a deletion policy to me

Well it was borrowed from the CloudFormation DeletionPolicy resource property, so that's not a coincidence.

@bwhaley
Copy link

bwhaley commented Jul 14, 2022

Got it! I think I was missing all the RBAC context. Now it makes more sense.

for any individual resource, but not for an entire controller or namespace

Correct. This is the current intention for this version of the proposal.

Nice, seems really clean then.

Well it was borrowed from the CloudFormation DeletionPolicy resource property, so that's not a coincidence.

FWIW, Terraform uses the term "prevent destroy" which also seems apropos to this proposal.

@eks-bot
Copy link

eks-bot commented Oct 12, 2022

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close.
Provide feedback via https://github.com/aws-controllers-k8s/community.
/lifecycle stale

@ack-bot ack-bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 12, 2022
@a-hilaly
Copy link
Member

/lifecycle frozen

@ack-bot
Copy link
Collaborator

ack-bot commented Oct 27, 2022

@a-hilaly: The lifecycle/frozen label cannot be applied to Pull Requests.

In response to this:

/lifecycle frozen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@a-hilaly
Copy link
Member

/remove-lifecycle stale

@ack-bot ack-bot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 27, 2022
Copy link
Member

@a-hilaly a-hilaly left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very clear 👍 can't wait to see this in action

Comment on lines +86 to +87
+ - --deletion-policy
+ - "retain"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the drift remediation feature, we went for resource level configuration. On top of the --deletion-policy flag, do we want to add another flag to have the similar experience for deletion protection? thinking:

--resource-deletion-policy Function=Retain \
--resource-deletion-policy EventSourceMappings=Retain

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I actually have some issues with the resource level configuration of drift remediation, in that it's a non-standard format and seems a little bit convoluted. But I'll leave those comments on your PR.

For now I'll stick to having a global deletion policy flag, and leave the door open to more granular CLI flags later

name: retained-bucket
```

## Backwards Compatibility and Risks
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might out of scope for this proposal: Currently resources are deleted right after a DELETE API Call is successful. Some resources still hang for multiple minutes, and even days.. e.g RDS::DBInstance, CloudTrail::EDS ... few open questions:

  • Do we still want to continue having same user experience in the future?
  • If a user sets the deletion policy to Delete, do we want to keep the CR alive reflecting the real state of a resource?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If a user sets the deletion policy to Delete, do we want to keep the CR alive reflecting the real state of a resource?

A deletion policy of Delete should be identical to the current implementation. A deletion policy of Retain will immediately remove the CR from K8s (since it takes no synchronous action on the underlying AWS resource)

@jljaco
Copy link
Member

jljaco commented Jan 9, 2023

/lgtm

@ack-bot ack-bot added the lgtm Indicates that a PR is ready to be merged. label Jan 9, 2023
@ack-bot
Copy link
Collaborator

ack-bot commented Jan 9, 2023

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jljaco, RedbackThomson

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:
  • OWNERS [RedbackThomson,jljaco]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@ack-bot ack-bot merged commit e548b73 into aws-controllers-k8s:main Jan 9, 2023
@RedbackThomson RedbackThomson deleted the delete-proposal branch January 9, 2023 23:50
ack-bot pushed a commit to aws-controllers-k8s/runtime that referenced this pull request Jan 10, 2023
Issue #, if available: aws-controllers-k8s/community#82

Description of changes:
As per the proposal: aws-controllers-k8s/community#1148

Adds support for a new optional CLI argument for the controller binary `--deletion-policy`, supporting `delete` and `retain` as available options. Setting this argument to `delete` leaves the controller as the current default behaviour, whereby it deletes the AWS resource under management before it is removed from the K8s API server. Setting this argument to `retain` modifies the default behaviour, leaving the AWS resources intact (taking no action on them) before removing it from the K8s API server.

Adds support for a new annotation on `Namespace` objects: `{service}.services.k8s.aws/deletion-policy: delete/retain` (where `{service}` is the service alias of the controller eg. `s3`). This annotation overrides the deletion policy behaviour configured for the controller for all resources deployed under the namespace.

Adds support for a new annotation on any ACK resource object: `services.k8s.aws/deletion-policy: delete/retain`. This annotation overrides only the deletion policy behaviour configured for the resource.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants