Mirror PersistentVolume ReclaimPolicy semantics #87

negz · 2019-11-19T06:52:48Z

Description of your changes

Fixes #21

This commit changes the meaning of the managed resource reclaim policy to match https://kubernetes.io/docs/concepts/storage/persistent-volumes/#reclaiming as closely as possible, minus the deprecated 'Recycle' policy. Previously the reclaim policy dictated only what happened to the external resource when its managed resource was deleted.

Note that the implementation here differs slightly from some of our discussions around the reclaim policy, but I believe that is due to our having misunderstood how the persistent volume reclaim policy that we wanted to mirror works. Based on my read of the above link, the behaviour I intend for this PR is:

Managed resources using the Delete policy are deleted when their bound resource claim is deleted.
Managed resources using the Delete policy delete their external resource when they are deleted.
Managed resources using the Retain policy are not deleted when their bound resource claim is deleted.
Managed resources using the Retain policy do not delete their external resource when they are deleted.
All managed resources transition to binding phase Released (not Unbound) when their resource claim is deleted.
A Released managed resource cannot be reused; it sticks around only so the infrastructure operator can perform any cleanup tasks they need to before manually deleting it.
A managed resource without a reclaim policy defaults to Retain, in that the absence of a reclaim policy is treated as Retain.
A resource class without a reclaim policy will dynamically provision managed resources with the Delete reclaim policy.

Testing Checklist

Before merging this PR I intend to:

Ensure statically provisioned managed resources default to Retain by dynamically provisioning a managed resource without a reclaim policy and observing that the managed resource's reclaim policy is unset.
Ensure dynamically provisioned managed resources default to Delete by dynamically provisioning a managed resource using a resource claim without a resource policy and observing that the managed resource's reclaim policy is Delete.
Ensure managed resources with policy Retain (or unset) are retained in binding phase Released when their claim is deleted.
Ensure managed resources with policy Retain (or unset) retain their external resources when deleted.
Ensure managed resources with policy Delete are deleted when their claim is deleted.
Ensure managed resources with policy Delete delete their external resources when deleted.

Checklist

I have:

Run make reviewable to ensure this PR is ready for review.
Ensured this PR contains a neat, self documenting set of commits.
Updated any relevant documentation, examples, or release notes.
Updated the RBAC permissions in clusterrole.yaml to include any new types.

upbound-bot · 2019-11-19T06:55:39Z

upbound-bot · 2019-11-19T07:23:24Z

upbound-bot · 2019-11-19T07:26:36Z

upbound-bot · 2019-11-19T07:46:29Z

muvaf

With this change, the behavior of ReclaimPolicy is extended to cover the relation between claim and managed, too. IMO, this is a public interface change, which would require a version bump on CRDs. However, we know that this is not the final destination we want to arrive per #21 and #27 . At some point, we want to separate the external resource policy from reclaim policy, which would then again be a breaking change. So, I think we should just implement external policy as well and then do the bump in this release only once.

Side note; for v1alpha* resources this may not be as important but for v1beta1 resources I do think that behavior change requires a bump to v1beta2.

muvaf · 2019-11-19T07:49:24Z

pkg/resource/api.go

+// Unbind the supplied Claim from the supplied Managed resource by removing the
+// managed resource's claim reference, transitioning it to binding phase
+// "Released", and if the managed resource's reclaim policy is "Delete",
+// deleting it.
 func (a *APIStatusBinder) Unbind(ctx context.Context, _ Claim, mg Managed) error {


I don't think we should call this Unbind anymore. Release maybe?

Release might be a little better a name, but I think Unbind is still decent and would thus prefer to avoid the API churn of renaming the method.

Yeah I see that, but still I'd expect the resource to be in Unbound state when I call Unbind

pkg/resource/resource.go

apis/core/v1alpha1/resource.go

negz · 2019-11-19T08:57:09Z

With this change, the behavior of ReclaimPolicy is extended to cover the relation between claim and managed, too. IMO, this is a public interface change, which would require a version bump on CRDs.

The Kubernetes API versioning guidance is not super explicit about this. For beta resources it states:

The schema and/or semantics of objects may change in incompatible ways in a subsequent beta or stable release.

As opposed to alpha APIs, which state:

The API may change in incompatible ways in a later software release without notice.

If a subsequent beta or stable release is interpreted to refer to the custom resource and not the software and we deem this semantic/behaviour change to be incompatible with the previous behaviour, then I agree that this warrants a version bump for beta resources.

However, we know that this is not the final destination we want to arrive per #21 and #27 . At some point, we want to separate the external resource policy from reclaim policy, which would then again be a breaking change.

I don't think this is true anymore. I'm much less confident these days that the externalResourcePolicy suggestion I included in #21 and #27 is a good direction. It was informed by my wanting to mirror the semantics of the persistent volume reclaim policy, but misunderstanding that said policy actually controls the relationship between claim and volume and between volume and external resource. @hasheddan spotted that a while back per #21 (comment) but I didn't grok it until today.

Presuming we still want to mirror the persistent volume semantics, then we don't need to (or indeed want to) introduce a second setting that determines whether external resources are deleted along with their managed resources; that is covered by reclaimPolicy.This leaves us potentially needing;

A setting that determines what we do when we want to create a managed resource that already exists.
A setting that determines what we do when an external resource's configuration drifts from its managed resource's configuration.

I'm quite strongly convinced these days that Crossplane should always be the source of truth for the managed resources it manages. For the above two potential configuration settings this means:

The options are either to return an error or adopt the external resource into Crossplane management. If we adopt the external resource we immediately start driving it to our desired state as specified by the spec of the managed resource that adopted it.
Should not be configurable. Crossplane is always the source of truth, and we always drive the external resource toward our source of truth when it diverges.

In summary, I feel we should move forward with this PR and don't expect to change how reclaimPolicy works again anytime soon. I do think we'll still want to introduce something like a .spec.adoptionPolicy that controls whether we return an error or adopt an external resource we encounter when trying to create a managed resource. Whether we do that now or later it belongs in a separate PR.

muvaf · 2019-11-19T11:21:29Z

If a subsequent beta or stable release is interpreted to refer to the custom resource and not the software and we deem this semantic/behaviour change to be incompatible with the previous behaviour, then I agree that this warrants a version bump for beta resources.

I think this is indeed the case. Quoting from the versioning guideline:

We chose to version at the API level rather than at the resource or field level to ensure that the API presents a clear, consistent view of system resources and behavior, and to enable controlling access to end-of-life and/or experimental APIs.
...
Note that API versioning and Software versioning are only indirectly related.

If it was a new field, I'd say we don't need the bump. But it's clear to me that the user gets a different behavior now when they choose Retain policy. Before, we'd delete the managed resource CR when a claim is deleted no matter what, but now we don't.

I feel like this is more of a product direction discussion rather than technical. I don't want to block this PR. Let's discuss it under #22

muvaf · 2019-11-19T14:52:55Z

I am not sure about the fact that when reclaim policy is set to Retain, even though the resource is retained after claim deletion, it cannot be bound to another claim. I know this is how it works in PVC/PV model but I think the cloud resources are not as application-specific/low-level as volumes. It seems like fair game to me for another application to claim a managed resource and use it.

hasheddan

I am good with this change as an incremental step and feel as though it is in-line with the design proposed. I do think working on importing resources / using existing resources should be prioritized, but not tightly coupled to this change.

hasheddan · 2019-11-19T15:11:20Z

Note: this will also require a slight change to core Crossplane as the ResourceClaim struct embeds BindingStatus. However, the claims would never be set to Released so I am not sure how urgent the change would be.

hasheddan · 2019-11-19T15:15:11Z

@muvaf once we solve #22 then this could still be accomplished. The retained managed resources could be manually cleaned up, then the remaining external resources could be "imported" and used by another claim.

muvaf

I am not sure about the fact that when reclaim policy is set to Retain, even though the resource is retained after claim deletion, it cannot be bound to another claim.

Discussed with @negz offline and now I think deferring this discussion for another time sounds like best path to move forward.

In short; I believe managed resources should be re-schedulable in some way. Things like volume or databases can be considered like one-application thing where it doesn't make a lot of sense for another application to take over but there are stateless cloud services and various scenarios where the user may want to re-use a resource after it's been released by a claim. In the future, we can introduce a ReclaimPolicy option to cover this and doing so when users ask for it makes more sense.

The resource import story will be covered #22

muvaf · 2019-11-19T22:10:35Z

pkg/resource/api.go

+// Unbind the supplied Claim from the supplied Managed resource by removing the
+// managed resource's claim reference, transitioning it to binding phase
+// "Released", and if the managed resource's reclaim policy is "Delete",
+// deleting it.
 func (a *APIStatusBinder) Unbind(ctx context.Context, _ Claim, mg Managed) error {


Yeah I see that, but still I'd expect the resource to be in Unbound state when I call Unbind

negz · 2019-11-19T22:33:13Z

Before, we'd delete the managed resource CR when a claim is deleted no matter what, but now we don't.

A small clarification - we'd delete the managed resource CR by default, but not no matter what. Whether or not the managed resource was deleted was determined by the cascading deletion policy, so kubectl delete --cascade=false someclaim would not delete the managed resource.

I am not sure about the fact that when reclaim policy is set to Retain, even though the resource is retained after claim deletion, it cannot be bound to another claim. I know this is how it works in PVC/PV model but I think the cloud resources are not as application-specific/low-level as volumes. It seems like fair game to me for another application to claim a managed resource and use it.

We discussed this on a call today, but I want to capture my (somewhat rambling) thoughts here too. I believe the following to be true with or without the change proposed by this PR:

Persistent volume claims and resource claims can be consumed by multiple pods simultaneously.
Persistent volume claims and resource claims can be consumed by multiple pods serially.

What changes with this PR is whether a managed resource can be claimed (which is distinct from used) multiple times serially, which brings us to philosophising about what is a claim? I see managed resources (and classes) as being "offerings" published by an infrastructure operator, and resource claims being just that - "claims" to one of those offerings by a particular entity. So an entity can claim a managed resource in a particular namespace, and that claim can be used by one or more entities within its namespace. At some point the entity who claimed the managed resource will not want it any more. I feel that a claim being deleted means "One or more entities were using this resource for a particular purpose or set of purposes. We no longer need this resource for that purpose or set of purposes." The question then is whether it's safe and sensible to offer the same underlying resource to another claimant, potentially in another namespace. I don't think it is a safe and sensible default to then allow another entity to claim that managed resource for a different set of purposes, given that by doing so they'll potentially gain access to any state that was left over by the previous claimant.

Put otherwise - "a big database instance" is a managed resource, while "the billing database instance" is a resource claim. By sharing or reusing at the resource claim level you're allowing multiple entities to use "the billing database". By reusing at the managed resource level you're allowing multiple entries to use "a big database instance", without consideration of the fact that the big database used to be the billing database instance.

Keep in mind the reclaim policy is set by the infrastructure operator, not the claimant, so the infrastructure operator is deciding what happens to it when it's done. The cases we support here are:

Delete - No cleanup needed, just delete the thing and be done with it.
Retain - I need or want to manually clean this thing up. Keep it around until I delete it manually. Maybe I want to ensure there's a backup of an SQL database, or want to retain a globally unique bucket name, or want to manually delete LoadBalancer type services from an EKS cluster so their ELBs are not orphaned when the cluster is deleted.

Persistent volumes support a deprecated Recycle policy that attempts to clean up a previously claimed PV and let others claim it. I'm guessing this was deprecated because it's difficult to securely erase volumes with arbitrary underlying storage technologies, and because there's just not that much point recycling a volume when you could instead treat it as cattle; delete it and dynamically provision an identical new one. I suspect these are both as or more true for our managed resources.

This commit changes the meaning of the resource claim resource policy to match https://kubernetes.io/docs/concepts/storage/persistent-volumes/#reclaiming as closely as possible, minus the deprecated 'Recycle' policy. Previously the reclaim policy dictated only what happened to the external resource when its managed resource was deleted. Signed-off-by: Nic Cope <negz@rk0n.org>

upbound-bot · 2019-11-19T22:46:31Z

muvaf · 2019-11-20T09:22:26Z

I don't think it is a safe and sensible default to then allow another entity to claim that managed resource for a different set of purposes, given that by doing so they'll potentially gain access to any state that was left over by the previous claimant.

I agree with this statement. It definitely shouldn't be the default. But I also see the value in making this possible. Opened #88 to track this discussion.

negz requested review from muvaf and hasheddan November 19, 2019 06:52

negz mentioned this pull request Nov 19, 2019

Don't set controller references for dynamically provisioned resources #83

Merged

4 tasks

negz added release/behaviour-change This PR will change controller behaviour when released. release/backwards-compatible labels Nov 19, 2019

negz force-pushed the reclaim branch from 02865cf to 571f874 Compare November 19, 2019 07:20

negz force-pushed the reclaim branch from 571f874 to a276e29 Compare November 19, 2019 07:23

negz force-pushed the reclaim branch from a276e29 to 302eb0f Compare November 19, 2019 07:43

muvaf reviewed Nov 19, 2019

View reviewed changes

negz mentioned this pull request Nov 19, 2019

Bump crossplane-runtime for updated ReclaimPolicy semantics crossplane-contrib/provider-gcp#104

Merged

4 tasks

muvaf mentioned this pull request Nov 19, 2019

Allow existing resources to be "imported" into Crossplane to be managed #22

Closed

hasheddan approved these changes Nov 19, 2019

View reviewed changes

muvaf approved these changes Nov 19, 2019

View reviewed changes

negz force-pushed the reclaim branch from 302eb0f to ca4b6b4 Compare November 19, 2019 22:43

negz merged commit 3706665 into crossplane:master Nov 19, 2019

negz deleted the reclaim branch November 19, 2019 22:48

muvaf mentioned this pull request Nov 20, 2019

Make it possible to reclaim a managed resource #88

Open

negz mentioned this pull request Feb 12, 2020

Re-using a GCP CloudSQLInstance crossplane-contrib/provider-gcp#157

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Mirror PersistentVolume ReclaimPolicy semantics #87

Mirror PersistentVolume ReclaimPolicy semantics #87

negz commented Nov 19, 2019 •

edited

Loading

upbound-bot commented Nov 19, 2019

upbound-bot commented Nov 19, 2019

upbound-bot commented Nov 19, 2019

upbound-bot commented Nov 19, 2019

muvaf left a comment

muvaf Nov 19, 2019

negz Nov 19, 2019

muvaf Nov 19, 2019

negz commented Nov 19, 2019 •

edited

Loading

muvaf commented Nov 19, 2019

muvaf commented Nov 19, 2019

hasheddan left a comment

hasheddan commented Nov 19, 2019

hasheddan commented Nov 19, 2019

muvaf left a comment

muvaf Nov 19, 2019

negz commented Nov 19, 2019 •

edited

Loading

upbound-bot commented Nov 19, 2019

muvaf commented Nov 20, 2019

Mirror PersistentVolume ReclaimPolicy semantics #87

Mirror PersistentVolume ReclaimPolicy semantics #87

Conversation

negz commented Nov 19, 2019 • edited Loading

Description of your changes

Testing Checklist

Checklist

upbound-bot commented Nov 19, 2019

upbound-bot commented Nov 19, 2019

upbound-bot commented Nov 19, 2019

upbound-bot commented Nov 19, 2019

muvaf left a comment

Choose a reason for hiding this comment

muvaf Nov 19, 2019

Choose a reason for hiding this comment

negz Nov 19, 2019

Choose a reason for hiding this comment

muvaf Nov 19, 2019

Choose a reason for hiding this comment

negz commented Nov 19, 2019 • edited Loading

muvaf commented Nov 19, 2019

muvaf commented Nov 19, 2019

hasheddan left a comment

Choose a reason for hiding this comment

hasheddan commented Nov 19, 2019

hasheddan commented Nov 19, 2019

muvaf left a comment

Choose a reason for hiding this comment

muvaf Nov 19, 2019

Choose a reason for hiding this comment

negz commented Nov 19, 2019 • edited Loading

upbound-bot commented Nov 19, 2019

muvaf commented Nov 20, 2019

negz commented Nov 19, 2019 •

edited

Loading

negz commented Nov 19, 2019 •

edited

Loading

negz commented Nov 19, 2019 •

edited

Loading