New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Semantics for destructive operations design proposal #1148
Semantics for destructive operations design proposal #1148
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am opposed to the entire "deletion protection" feature, for reasons stated inline in my comments.
I would be fine with a "unadopt resource" or "unmanage resource" explicit action and user experience, however.
Please rename this proposal to Deletion protection
because I'd like to see a separate proposal on cascading delete functionality (which I consider to be more aligned with the declarative configuration management principles that ACK should be encouraging).
docs/design/proposals/destructive-operations/destructive-operations.md
Outdated
Show resolved
Hide resolved
docs/design/proposals/destructive-operations/destructive-operations.md
Outdated
Show resolved
Hide resolved
docs/design/proposals/destructive-operations/destructive-operations.md
Outdated
Show resolved
Hide resolved
|
||
* Helm uses an annotation: `helm.sh/resource-policy: keep` | ||
* Google Config Connector uses an annotation: `cnrm.cloud.google.com/deletion-policy: abandon` | ||
* Skaffold uses a label: `skaffold.dev/cleanup: "false"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There were also proposals originally to use an annotation/label for our resource adoption functionality. Instead of doing the annotation/label on a CR, we chose a user experience where the user needs to explicitly adopt an AWS resource by creating a services.k8s.aws/AdoptedResource
CR. I continue to believe that we made the right decision in this regard: an explicit adoption action (via the kubectl create -f <services.k8s.aws/AdoptedResource>
process) is the least surprising, hardest to misuse interface.
For the reverse of adoption action, I would prefer to have a similarly explicit action, essentially an "unadopt" or "unmanage" action. We could do this in a couple ways:
- Create a new kubectl plugin/verb called
kubectl unmanage
that the user could call to delete the CR but keep the AWS resource around kubectl delete services.k8s.aws/AdoptedResource/<resource arn>
and have the AdoptedResourceReconciler run some "delete the CR but keep the AWS resource around" code
|
||
#### Examples | ||
|
||
Definition at the controller - all resources reconciled by this controller will use the “retain” policy (unless overridden by an annotation): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not think "retain" should be the default use-case. We can keep current implementation as default i.e. Deleting a resource cleans things up on AWS side as well. In my opinion this will be default thinking of most users.
And option to override this cli-option for production use, stateful resources and databases will give users ability to prevent accidental deletion.
Another less popular option can be if we get opinionated about AWS service controllers and store metadata about which service controllers will have "retain" as default option. Then helm/kustomization artifacts will only have deletion policy "retain" for those service controllers. However this bring inconsistency among service controllers by default.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not think "retain" should be the default use-case
"delete" is the default. All default ACK functionality will remain the same for all considerations of this proposal. Only in this particular example do we override the default and set it to be "retain" at the command line argument level.
And option to override this cli-option for production use, stateful resources and databases will give users ability to prevent accidental deletion.
That is what this example is attempting to show.
I actually do agree with this. I wanted to write this proposal to get critical feedback like this. As I was writing the proposal I felt that we would be overengineering functionality. However, I still felt it was important to explore the technical challenges involved just so that we could all see the pros and cons of the function. I will be removing the "Deletion protection" from this proposal - I might add an appendix where I mention this discussion to keep it for posterity. I will pivot this proposal into discussing only the "AWS Removal Policy" functionality. On that front:
The reason we chose to have a separate CRD for adoption was mainly because we could not fit enough information into the existing CRD (or metadata) of the ACK resources, or we would have had to significantly redesign the optional/required fields. Further, the adopted resource CR acts as a marker for, and stores the state for, the adoption reconciler to create a new ACK resource which the controller then reconciles as if it were previously created. I don't believe the removal policy needs anything as sophisticated as this. This proposal only addresses whether the controller will or will not call Your kubectl annotate bucket my-bucket services.k8s.aws/deletion-policy=retain
kubectl delete bucket my-bucket |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like this proposal. Great job!
I like how you broke the problem into two main parts, and referenced similar issues from other communities too.
I just left some clarifying questions and small comments.
docs/design/proposals/destructive-operations/destructive-operations.md
Outdated
Show resolved
Hide resolved
docs/design/proposals/destructive-operations/destructive-operations.md
Outdated
Show resolved
Hide resolved
docs/design/proposals/destructive-operations/destructive-operations.md
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm still not in favour of any of this feature as described in this proposal. The only thing I would be in favour of is something equivalent to an explicit kubectl unadopt
command.
The absolute last UI we should be emulating is CloudFormation, IMHO.
|
||
When deleting an ACK custom resource, the underlying AWS resource is first deleted. Users should expect this to be the default behaviour, as creating and updating CR similarly have the same respective actions in AWS. However users have requested that some resources should continue to live on after the ACK CR has been deleted. Resources with stateful data, such as S3 buckets, RDS instance or EBS volumes, may want to be retained so that they may be migrated between ACK installations or off ACK entirely without requiring they be backed up, deleted and then restored. | ||
|
||
Similar to the CDK[`RemovalPolicy` type](https://docs.aws.amazon.com/cdk/api/v2//docs/aws-cdk-lib.RemovalPolicy.html#members) and the native [K8s PV reclaim policy](https://kubernetes.io/docs/concepts/storage/persistent-volumes/#reclaiming), ACK should support the ability to leave the AWS resource alive while still deleting the custom resource object. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll be honest here... any time I see a reference to CloudFormation's (this is CDK, yes, but it's all CloudFormation...) DeletionPolicy, I shudder. I really despise the entire user experience of configuring some tool to behave different ways when the user takes a specific action (the action in this case is the DELETE
action. This is a non-obvious interface and I am opposed to it. Instead, we should encourage a user experience that requires the user to explicitly call a different action in order to have the tool behave differently.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After chatting with @RedbackThomson a bit more about this, I'm going to take a slightly less rigid approach to this. Kubernetes API unfortunately does not have a "control channel" for which to send instructions to a controller about how to handle a particular action. Instead, we have the metadata.annotations
Hammer which is unfortunate IMHO because this is on the resource and not where it should be: as decorative data on the request.
Alas, the metadata.annotations
Hammer is all we have so I will reconsider this proposal in light of that and see if I can come up with a variation that doesn't offend my sense of poor coupling.
Issues go stale after 90d of inactivity. |
/remove-lifecycle stale |
Having read the proposal and the comments, I can see that there are some strong opinions and fierce opposition to this feature, so perhaps the point is moot. Nevertheless, I'd like to weigh in for some additional perspective since my team is starting to evaluate ACK and this was one area of concern. Many teams, ours included, use k8s only for stateless applications. Losing a deployment/service/configmap due to an erroneous merge is easy to rollback and no permanent damage is done. Losing an RDS instance or S3 bucket, OTOH, could be a disaster from which there is no recovery. For that reason, some kind of protection would be welcomed. Additionally, when using managed Kubernetes services like EKS in which the customer has no control plane access, there's a certain nervous tick that something could go wrong in etcd completely outside of customer control. What if a weird bug in EKS caused objects in etcd to disappear and as a result the underlying AWS resources are deleted? Some of the protections mentioned in the proposal might protect from this eventuality (e.g. a policy on the controller). Perhaps this is highly unlikely/nigh impossible, but is it fair to write off this worry entirely? We've already experienced unexplainable behavior with etcd on EKS (not this specific thing) so I think the concern is well founded. "Unadopting" resources is only useful when removal is planned. It's the unplanned errors that I'd worry about. I do hope this proposal will remain open for others to offer still more perspectives. |
Although unlikely, I understand your point. The controller is designed to delete resources when it sees the I do still agree with @jaypipes with respect to not having deletion protection. We would essentially be recreating RBAC in a very naive way, introducing yet another mechanism for people to configure permissions on top of K8s RBAC and AWS IAM. Those two systems are far better suited for that style of fine-grained access control. If you really, really wanted to make sure your production systems can't be deleted by an ACK controller, you could even modify the IAM role associated with the controller and deny all However, the proposal for adding a This proposal will remain open until either someone decides to act on it - doesn't have to be the ACK core team, this feature is open for anyone to take a crack at - or until it is dismissed for one reason or another. |
AFAICT this proposal or any resource protection feature doesn't resemble RBAC in the least. There isn't anything about access control or permissions. Rather, this is about how to decide whether a resource should be deleted.
Interesting. So you're suggesting something like (though not exactly) the following:
for any individual resource, but not for an entire controller or namespace? Am I understanding your point correctly? If so, that's great, but that smells like a deletion policy to me. |
Oh, actually, I removed that from the proposal altogether. That's what the majority of the disagreement was about - designing an annotation that prevents deletion of a given resource. Nevermind, then.
Correct. This is the current intention for this version of the proposal.
Well it was borrowed from the CloudFormation |
Got it! I think I was missing all the RBAC context. Now it makes more sense.
Nice, seems really clean then.
FWIW, Terraform uses the term "prevent destroy" which also seems apropos to this proposal. |
Issues go stale after 90d of inactivity. |
/lifecycle frozen |
@a-hilaly: The In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/remove-lifecycle stale |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very clear 👍 can't wait to see this in action
+ - --deletion-policy | ||
+ - "retain" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the drift remediation feature, we went for resource level configuration. On top of the --deletion-policy
flag, do we want to add another flag to have the similar experience for deletion protection? thinking:
--resource-deletion-policy Function=Retain \
--resource-deletion-policy EventSourceMappings=Retain
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I actually have some issues with the resource level configuration of drift remediation, in that it's a non-standard format and seems a little bit convoluted. But I'll leave those comments on your PR.
For now I'll stick to having a global deletion policy flag, and leave the door open to more granular CLI flags later
name: retained-bucket | ||
``` | ||
|
||
## Backwards Compatibility and Risks |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This might out of scope for this proposal: Currently resources are deleted right after a DELETE API Call is successful. Some resources still hang for multiple minutes, and even days.. e.g RDS::DBInstance
, CloudTrail::EDS
... few open questions:
- Do we still want to continue having same user experience in the future?
- If a user sets the deletion policy to
Delete
, do we want to keep the CR alive reflecting the real state of a resource?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If a user sets the deletion policy to
Delete
, do we want to keep the CR alive reflecting the real state of a resource?
A deletion policy of Delete
should be identical to the current implementation. A deletion policy of Retain
will immediately remove the CR from K8s (since it takes no synchronous action on the underlying AWS resource)
/lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: jljaco, RedbackThomson The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Issue #, if available: aws-controllers-k8s/community#82 Description of changes: As per the proposal: aws-controllers-k8s/community#1148 Adds support for a new optional CLI argument for the controller binary `--deletion-policy`, supporting `delete` and `retain` as available options. Setting this argument to `delete` leaves the controller as the current default behaviour, whereby it deletes the AWS resource under management before it is removed from the K8s API server. Setting this argument to `retain` modifies the default behaviour, leaving the AWS resources intact (taking no action on them) before removing it from the K8s API server. Adds support for a new annotation on `Namespace` objects: `{service}.services.k8s.aws/deletion-policy: delete/retain` (where `{service}` is the service alias of the controller eg. `s3`). This annotation overrides the deletion policy behaviour configured for the controller for all resources deployed under the namespace. Adds support for a new annotation on any ACK resource object: `services.k8s.aws/deletion-policy: delete/retain`. This annotation overrides only the deletion policy behaviour configured for the resource. By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
Issue #, if available: #82
Description of changes:
Draft proposal for mechanisms to protect users against accidental deletion of ACK and/or AWS resources
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.