Semantics for destructive operations design proposal #1148

RedbackThomson · 2022-01-28T00:38:38Z

Issue #, if available: #82

Description of changes:
Draft proposal for mechanisms to protect users against accidental deletion of ACK and/or AWS resources

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

jaypipes

I am opposed to the entire "deletion protection" feature, for reasons stated inline in my comments.

I would be fine with a "unadopt resource" or "unmanage resource" explicit action and user experience, however.

Please rename this proposal to Deletion protection because I'd like to see a separate proposal on cascading delete functionality (which I consider to be more aligned with the declarative configuration management principles that ACK should be encouraging).

docs/design/proposals/destructive-operations/destructive-operations.md

jaypipes · 2022-01-28T15:41:19Z

docs/design/proposals/destructive-operations/destructive-operations.md

+
+* Helm uses an annotation: `helm.sh/resource-policy: keep`
+* Google Config Connector uses an annotation: `cnrm.cloud.google.com/deletion-policy: abandon`
+* Skaffold uses a label: `skaffold.dev/cleanup: "false"`


There were also proposals originally to use an annotation/label for our resource adoption functionality. Instead of doing the annotation/label on a CR, we chose a user experience where the user needs to explicitly adopt an AWS resource by creating a services.k8s.aws/AdoptedResource CR. I continue to believe that we made the right decision in this regard: an explicit adoption action (via the kubectl create -f <services.k8s.aws/AdoptedResource> process) is the least surprising, hardest to misuse interface.

For the reverse of adoption action, I would prefer to have a similarly explicit action, essentially an "unadopt" or "unmanage" action. We could do this in a couple ways:

Create a new kubectl plugin/verb called kubectl unmanage that the user could call to delete the CR but keep the AWS resource around

kubectl delete services.k8s.aws/AdoptedResource/<resource arn> and have the AdoptedResourceReconciler run some "delete the CR but keep the AWS resource around" code

vijtrip2 · 2022-01-28T22:05:34Z

docs/design/proposals/destructive-operations/destructive-operations.md

+
+#### Examples
+
+Definition at the controller - all resources reconciled by this controller will use the “retain” policy (unless overridden by an annotation):


I do not think "retain" should be the default use-case. We can keep current implementation as default i.e. Deleting a resource cleans things up on AWS side as well. In my opinion this will be default thinking of most users.

And option to override this cli-option for production use, stateful resources and databases will give users ability to prevent accidental deletion.

Another less popular option can be if we get opinionated about AWS service controllers and store metadata about which service controllers will have "retain" as default option. Then helm/kustomization artifacts will only have deletion policy "retain" for those service controllers. However this bring inconsistency among service controllers by default.

I do not think "retain" should be the default use-case

"delete" is the default. All default ACK functionality will remain the same for all considerations of this proposal. Only in this particular example do we override the default and set it to be "retain" at the command line argument level.

And option to override this cli-option for production use, stateful resources and databases will give users ability to prevent accidental deletion.

That is what this example is attempting to show.

RedbackThomson · 2022-01-28T22:21:24Z

@jaypipes

I am opposed to the entire "deletion protection" feature, for reasons stated inline in my comments.

I actually do agree with this. I wanted to write this proposal to get critical feedback like this. As I was writing the proposal I felt that we would be overengineering functionality. However, I still felt it was important to explore the technical challenges involved just so that we could all see the pros and cons of the function.

I will be removing the "Deletion protection" from this proposal - I might add an appendix where I mention this discussion to keep it for posterity. I will pivot this proposal into discussing only the "AWS Removal Policy" functionality.

On that front:

For the reverse of adoption action, I would prefer to have a similarly explicit action, essentially an "unadopt" or "unmanage" action

The reason we chose to have a separate CRD for adoption was mainly because we could not fit enough information into the existing CRD (or metadata) of the ACK resources, or we would have had to significantly redesign the optional/required fields. Further, the adopted resource CR acts as a marker for, and stores the state for, the adoption reconciler to create a new ACK resource which the controller then reconciles as if it were previously created.

I don't believe the removal policy needs anything as sophisticated as this. This proposal only addresses whether the controller will or will not call Delete* in the SDK before it deletes the finalizer. We do not need to modify any elements of the spec or metadata (apart from this single annotation) on the ACK resource. The ACK resource is also immediately deleted after the delete operation, so having a remaining "Unadoption" custom resource would essentially only act as a log.

Your kubectl unadopt alias could easily be made, and would be as simple as:

kubectl annotate bucket my-bucket services.k8s.aws/deletion-policy=retain
kubectl delete bucket my-bucket

vijtrip2

I like this proposal. Great job!
I like how you broke the problem into two main parts, and referenced similar issues from other communities too.

I just left some clarifying questions and small comments.

docs/design/proposals/destructive-operations/destructive-operations.md

jaypipes

I'm still not in favour of any of this feature as described in this proposal. The only thing I would be in favour of is something equivalent to an explicit kubectl unadopt command.

The absolute last UI we should be emulating is CloudFormation, IMHO.

jaypipes · 2022-02-16T20:07:30Z

docs/design/proposals/destructive-operations/destructive-operations.md

+
+When deleting an ACK custom resource, the underlying AWS resource is first deleted. Users should expect this to be the default behaviour, as creating and updating CR similarly have the same respective actions in AWS. However users have requested that some resources should continue to live on after the ACK CR has been deleted. Resources with stateful data, such as S3 buckets, RDS instance or EBS volumes, may want to be retained so that they may be migrated between ACK installations or off ACK entirely without requiring they be backed up, deleted and then restored.
+
+Similar to the CDK[`RemovalPolicy` type](https://docs.aws.amazon.com/cdk/api/v2//docs/aws-cdk-lib.RemovalPolicy.html#members) and the native [K8s PV reclaim policy](https://kubernetes.io/docs/concepts/storage/persistent-volumes/#reclaiming), ACK should support the ability to leave the AWS resource alive while still deleting the custom resource object. 


I'll be honest here... any time I see a reference to CloudFormation's (this is CDK, yes, but it's all CloudFormation...) DeletionPolicy, I shudder. I really despise the entire user experience of configuring some tool to behave different ways when the user takes a specific action (the action in this case is the DELETE action. This is a non-obvious interface and I am opposed to it. Instead, we should encourage a user experience that requires the user to explicitly call a different action in order to have the tool behave differently.

After chatting with @RedbackThomson a bit more about this, I'm going to take a slightly less rigid approach to this. Kubernetes API unfortunately does not have a "control channel" for which to send instructions to a controller about how to handle a particular action. Instead, we have the metadata.annotations Hammer which is unfortunate IMHO because this is on the resource and not where it should be: as decorative data on the request.

Alas, the metadata.annotations Hammer is all we have so I will reconsider this proposal in light of that and see if I can come up with a variation that doesn't offend my sense of poor coupling.

ack-bot · 2022-05-17T23:13:44Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close.
Provide feedback via https://github.com/aws-controllers-k8s/community.
/lifecycle stale

RedbackThomson · 2022-05-18T19:53:40Z

/remove-lifecycle stale

bwhaley · 2022-07-13T23:43:15Z

Having read the proposal and the comments, I can see that there are some strong opinions and fierce opposition to this feature, so perhaps the point is moot. Nevertheless, I'd like to weigh in for some additional perspective since my team is starting to evaluate ACK and this was one area of concern.

Many teams, ours included, use k8s only for stateless applications. Losing a deployment/service/configmap due to an erroneous merge is easy to rollback and no permanent damage is done. Losing an RDS instance or S3 bucket, OTOH, could be a disaster from which there is no recovery. For that reason, some kind of protection would be welcomed.

Additionally, when using managed Kubernetes services like EKS in which the customer has no control plane access, there's a certain nervous tick that something could go wrong in etcd completely outside of customer control. What if a weird bug in EKS caused objects in etcd to disappear and as a result the underlying AWS resources are deleted? Some of the protections mentioned in the proposal might protect from this eventuality (e.g. a policy on the controller). Perhaps this is highly unlikely/nigh impossible, but is it fair to write off this worry entirely? We've already experienced unexplainable behavior with etcd on EKS (not this specific thing) so I think the concern is well founded.

"Unadopting" resources is only useful when removal is planned. It's the unplanned errors that I'd worry about.

I do hope this proposal will remain open for others to offer still more perspectives.

RedbackThomson · 2022-07-14T19:49:07Z

What if a weird bug in EKS caused objects in etcd to disappear and as a result the underlying AWS resources are deleted

Although unlikely, I understand your point. The controller is designed to delete resources when it sees the deletionTimestamp property is set to a non-nil value. Disappearing from etcd is less of a concern, but some other tool setting this value on behalf on an unwilling user is a fair problem.

I do still agree with @jaypipes with respect to not having deletion protection. We would essentially be recreating RBAC in a very naive way, introducing yet another mechanism for people to configure permissions on top of K8s RBAC and AWS IAM. Those two systems are far better suited for that style of fine-grained access control. If you really, really wanted to make sure your production systems can't be deleted by an ACK controller, you could even modify the IAM role associated with the controller and deny all Delete:* permissions.

However, the proposal for adding a retain setting on resources I think still stands. There may be situations where you'd like to migrate stateful applications between ACK-enabled clusters. Setting retain on each of your stateful application infrastructure elements, deleting them from the cluster and then adopting them into the new cluster would greatly reduce the chance of accidental deletion. Furthermore, proactively setting retain on stateful resources would protect those resources from being accidentally deleted by someone who doesn't realise they're in the wrong cluster context.

This proposal will remain open until either someone decides to act on it - doesn't have to be the ACK core team, this feature is open for anyone to take a crack at - or until it is dismissed for one reason or another.

bwhaley · 2022-07-14T20:23:55Z

We would essentially be recreating RBAC in a very naive way, introducing yet another mechanism for people to configure permissions on top of K8s RBAC and AWS IAM.

AFAICT this proposal or any resource protection feature doesn't resemble RBAC in the least. There isn't anything about access control or permissions. Rather, this is about how to decide whether a resource should be deleted.

However, the proposal for adding a retain setting on resources I think still stands

Interesting. So you're suggesting something like (though not exactly) the following:

s3.services.k8s.aws/deletion-policy: retain

for any individual resource, but not for an entire controller or namespace? Am I understanding your point correctly?

If so, that's great, but that smells like a deletion policy to me.

RedbackThomson · 2022-07-14T20:27:06Z

AFAICT this proposal or any resource protection feature doesn't resemble RBAC in the least. There isn't anything about access control or permissions. Rather, this is about how to decide whether a resource should be deleted.

Oh, actually, I removed that from the proposal altogether. That's what the majority of the disagreement was about - designing an annotation that prevents deletion of a given resource. Nevermind, then.

for any individual resource, but not for an entire controller or namespace

Correct. This is the current intention for this version of the proposal.

smells like a deletion policy to me

Well it was borrowed from the CloudFormation DeletionPolicy resource property, so that's not a coincidence.

bwhaley · 2022-07-14T20:30:10Z

Got it! I think I was missing all the RBAC context. Now it makes more sense.

for any individual resource, but not for an entire controller or namespace

Correct. This is the current intention for this version of the proposal.

Nice, seems really clean then.

Well it was borrowed from the CloudFormation DeletionPolicy resource property, so that's not a coincidence.

FWIW, Terraform uses the term "prevent destroy" which also seems apropos to this proposal.

eks-bot · 2022-10-12T22:02:06Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close.
Provide feedback via https://github.com/aws-controllers-k8s/community.
/lifecycle stale

a-hilaly · 2022-10-27T19:18:17Z

/lifecycle frozen

ack-bot · 2022-10-27T19:18:18Z

@a-hilaly: The lifecycle/frozen label cannot be applied to Pull Requests.

In response to this:

/lifecycle frozen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

a-hilaly · 2022-10-27T19:18:36Z

/remove-lifecycle stale

a-hilaly

Very clear 👍 can't wait to see this in action

docs/design/proposals/destructive-operations/destructive-operations.md

a-hilaly · 2023-01-05T14:37:13Z

docs/design/proposals/destructive-operations/destructive-operations.md

+  - --deletion-policy
+  - "retain"


For the drift remediation feature, we went for resource level configuration. On top of the --deletion-policy flag, do we want to add another flag to have the similar experience for deletion protection? thinking:

--resource-deletion-policy Function=Retain \ --resource-deletion-policy EventSourceMappings=Retain

I think I actually have some issues with the resource level configuration of drift remediation, in that it's a non-standard format and seems a little bit convoluted. But I'll leave those comments on your PR.

For now I'll stick to having a global deletion policy flag, and leave the door open to more granular CLI flags later

a-hilaly · 2023-01-05T14:51:10Z

docs/design/proposals/destructive-operations/destructive-operations.md

+  name: retained-bucket
+```
+
+## Backwards Compatibility and Risks


This might out of scope for this proposal: Currently resources are deleted right after a DELETE API Call is successful. Some resources still hang for multiple minutes, and even days.. e.g RDS::DBInstance, CloudTrail::EDS ... few open questions:

Do we still want to continue having same user experience in the future?

If a user sets the deletion policy to Delete, do we want to keep the CR alive reflecting the real state of a resource?

If a user sets the deletion policy to Delete, do we want to keep the CR alive reflecting the real state of a resource?

A deletion policy of Delete should be identical to the current implementation. A deletion policy of Retain will immediately remove the CR from K8s (since it takes no synchronous action on the underlying AWS resource)

jljaco · 2023-01-09T19:04:02Z

/lgtm

ack-bot · 2023-01-09T19:04:09Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jljaco, RedbackThomson

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [RedbackThomson,jljaco]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Issue #, if available: aws-controllers-k8s/community#82 Description of changes: As per the proposal: aws-controllers-k8s/community#1148 Adds support for a new optional CLI argument for the controller binary `--deletion-policy`, supporting `delete` and `retain` as available options. Setting this argument to `delete` leaves the controller as the current default behaviour, whereby it deletes the AWS resource under management before it is removed from the K8s API server. Setting this argument to `retain` modifies the default behaviour, leaving the AWS resources intact (taking no action on them) before removing it from the K8s API server. Adds support for a new annotation on `Namespace` objects: `{service}.services.k8s.aws/deletion-policy: delete/retain` (where `{service}` is the service alias of the controller eg. `s3`). This annotation overrides the deletion policy behaviour configured for the controller for all resources deployed under the namespace. Adds support for a new annotation on any ACK resource object: `services.k8s.aws/deletion-policy: delete/retain`. This annotation overrides only the deletion policy behaviour configured for the resource. By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

Add destructive operations draft

4e9107c

ack-bot requested review from a-hilaly and jaypipes January 28, 2022 00:38

ack-bot added the approved label Jan 28, 2022

jaypipes requested changes Jan 28, 2022

View reviewed changes

vijtrip2 reviewed Jan 28, 2022

View reviewed changes

Remove reference to delete protection

ffd9ee7

jaypipes reviewed Feb 16, 2022

View reviewed changes

ack-bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 17, 2022

ack-bot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 18, 2022

aws-controllers-k8s deleted a comment from ack-bot May 18, 2022

ack-bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 12, 2022

ack-bot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 27, 2022

a-hilaly reviewed Jan 5, 2023

View reviewed changes

RedbackThomson mentioned this pull request Jan 6, 2023

Add deletion policy CLI arguments and annotations aws-controllers-k8s/runtime#107

Merged

ack-bot assigned jljaco Jan 9, 2023

ack-bot added the lgtm Indicates that a PR is ready to be merged. label Jan 9, 2023

ack-bot merged commit e548b73 into aws-controllers-k8s:main Jan 9, 2023

RedbackThomson deleted the delete-proposal branch January 9, 2023 23:50

RedbackThomson mentioned this pull request Jan 19, 2023

Semantics of destructive operations #82

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Semantics for destructive operations design proposal #1148

Semantics for destructive operations design proposal #1148

RedbackThomson commented Jan 28, 2022

jaypipes left a comment

jaypipes Jan 28, 2022

vijtrip2 Jan 28, 2022

RedbackThomson Jan 28, 2022

RedbackThomson commented Jan 28, 2022 •

edited

vijtrip2 left a comment

jaypipes left a comment

jaypipes Feb 16, 2022

jaypipes Feb 16, 2022

ack-bot commented May 17, 2022

RedbackThomson commented May 18, 2022

bwhaley commented Jul 13, 2022

RedbackThomson commented Jul 14, 2022 •

edited

bwhaley commented Jul 14, 2022 •

edited

RedbackThomson commented Jul 14, 2022

bwhaley commented Jul 14, 2022 •

edited

eks-bot commented Oct 12, 2022

a-hilaly commented Oct 27, 2022

ack-bot commented Oct 27, 2022

a-hilaly commented Oct 27, 2022

a-hilaly left a comment

a-hilaly Jan 5, 2023

RedbackThomson Jan 5, 2023

a-hilaly Jan 5, 2023

RedbackThomson Jan 5, 2023

jljaco commented Jan 9, 2023

ack-bot commented Jan 9, 2023


		#### Examples

		Definition at the controller - all resources reconciled by this controller will use the “retain” policy (unless overridden by an annotation):


		When deleting an ACK custom resource, the underlying AWS resource is first deleted. Users should expect this to be the default behaviour, as creating and updating CR similarly have the same respective actions in AWS. However users have requested that some resources should continue to live on after the ACK CR has been deleted. Resources with stateful data, such as S3 buckets, RDS instance or EBS volumes, may want to be retained so that they may be migrated between ACK installations or off ACK entirely without requiring they be backed up, deleted and then restored.

		Similar to the CDK[`RemovalPolicy` type](https://docs.aws.amazon.com/cdk/api/v2//docs/aws-cdk-lib.RemovalPolicy.html#members) and the native [K8s PV reclaim policy](https://kubernetes.io/docs/concepts/storage/persistent-volumes/#reclaiming), ACK should support the ability to leave the AWS resource alive while still deleting the custom resource object.

Semantics for destructive operations design proposal #1148

Semantics for destructive operations design proposal #1148

Conversation

RedbackThomson commented Jan 28, 2022

jaypipes left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

RedbackThomson commented Jan 28, 2022 • edited

vijtrip2 left a comment

Choose a reason for hiding this comment

jaypipes left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ack-bot commented May 17, 2022

RedbackThomson commented May 18, 2022

bwhaley commented Jul 13, 2022

RedbackThomson commented Jul 14, 2022 • edited

bwhaley commented Jul 14, 2022 • edited

RedbackThomson commented Jul 14, 2022

bwhaley commented Jul 14, 2022 • edited

eks-bot commented Oct 12, 2022

a-hilaly commented Oct 27, 2022

ack-bot commented Oct 27, 2022

a-hilaly commented Oct 27, 2022

a-hilaly left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jljaco commented Jan 9, 2023

ack-bot commented Jan 9, 2023

RedbackThomson commented Jan 28, 2022 •

edited

RedbackThomson commented Jul 14, 2022 •

edited

bwhaley commented Jul 14, 2022 •

edited

bwhaley commented Jul 14, 2022 •

edited