Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mirror PersistentVolume ReclaimPolicy semantics #87

Merged
merged 1 commit into from
Nov 19, 2019

Conversation

negz
Copy link
Member

@negz negz commented Nov 19, 2019

Description of your changes

Fixes #21

This commit changes the meaning of the managed resource reclaim policy to match https://kubernetes.io/docs/concepts/storage/persistent-volumes/#reclaiming as closely as possible, minus the deprecated 'Recycle' policy. Previously the reclaim policy dictated only what happened to the external resource when its managed resource was deleted.

Note that the implementation here differs slightly from some of our discussions around the reclaim policy, but I believe that is due to our having misunderstood how the persistent volume reclaim policy that we wanted to mirror works. Based on my read of the above link, the behaviour I intend for this PR is:

  • Managed resources using the Delete policy are deleted when their bound resource claim is deleted.
  • Managed resources using the Delete policy delete their external resource when they are deleted.
  • Managed resources using the Retain policy are not deleted when their bound resource claim is deleted.
  • Managed resources using the Retain policy do not delete their external resource when they are deleted.
  • All managed resources transition to binding phase Released (not Unbound) when their resource claim is deleted.
  • A Released managed resource cannot be reused; it sticks around only so the infrastructure operator can perform any cleanup tasks they need to before manually deleting it.
  • A managed resource without a reclaim policy defaults to Retain, in that the absence of a reclaim policy is treated as Retain.
  • A resource class without a reclaim policy will dynamically provision managed resources with the Delete reclaim policy.

Testing Checklist

Before merging this PR I intend to:

  • Ensure statically provisioned managed resources default to Retain by dynamically provisioning a managed resource without a reclaim policy and observing that the managed resource's reclaim policy is unset.
  • Ensure dynamically provisioned managed resources default to Delete by dynamically provisioning a managed resource using a resource claim without a resource policy and observing that the managed resource's reclaim policy is Delete.
  • Ensure managed resources with policy Retain (or unset) are retained in binding phase Released when their claim is deleted.
  • Ensure managed resources with policy Retain (or unset) retain their external resources when deleted.
  • Ensure managed resources with policy Delete are deleted when their claim is deleted.
  • Ensure managed resources with policy Delete delete their external resources when deleted.

Checklist

I have:

  • Run make reviewable to ensure this PR is ready for review.
  • Ensured this PR contains a neat, self documenting set of commits.
  • Updated any relevant documentation, examples, or release notes.
  • Updated the RBAC permissions in clusterrole.yaml to include any new types.

@negz negz requested review from muvaf and hasheddan November 19, 2019 06:52
@upbound-bot
Copy link
Collaborator

92% (+0.07%) vs master 92%

@upbound-bot
Copy link
Collaborator

92% (-0.15%) vs master 92%

@upbound-bot
Copy link
Collaborator

92% (-0.15%) vs master 92%

@upbound-bot
Copy link
Collaborator

92% (-0.14%) vs master 92%

Copy link
Member

@muvaf muvaf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With this change, the behavior of ReclaimPolicy is extended to cover the relation between claim and managed, too. IMO, this is a public interface change, which would require a version bump on CRDs. However, we know that this is not the final destination we want to arrive per #21 and #27 . At some point, we want to separate the external resource policy from reclaim policy, which would then again be a breaking change. So, I think we should just implement external policy as well and then do the bump in this release only once.

Side note; for v1alpha* resources this may not be as important but for v1beta1 resources I do think that behavior change requires a bump to v1beta2.

// Unbind the supplied Claim from the supplied Managed resource by removing the
// managed resource's claim reference, transitioning it to binding phase
// "Released", and if the managed resource's reclaim policy is "Delete",
// deleting it.
func (a *APIStatusBinder) Unbind(ctx context.Context, _ Claim, mg Managed) error {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should call this Unbind anymore. Release maybe?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Release might be a little better a name, but I think Unbind is still decent and would thus prefer to avoid the API churn of renaming the method.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I see that, but still I'd expect the resource to be in Unbound state when I call Unbind

pkg/resource/resource.go Show resolved Hide resolved
apis/core/v1alpha1/resource.go Show resolved Hide resolved
@negz
Copy link
Member Author

negz commented Nov 19, 2019

With this change, the behavior of ReclaimPolicy is extended to cover the relation between claim and managed, too. IMO, this is a public interface change, which would require a version bump on CRDs.

The Kubernetes API versioning guidance is not super explicit about this. For beta resources it states:

The schema and/or semantics of objects may change in incompatible ways in a subsequent beta or stable release.

As opposed to alpha APIs, which state:

The API may change in incompatible ways in a later software release without notice.

If a subsequent beta or stable release is interpreted to refer to the custom resource and not the software and we deem this semantic/behaviour change to be incompatible with the previous behaviour, then I agree that this warrants a version bump for beta resources.

However, we know that this is not the final destination we want to arrive per #21 and #27 . At some point, we want to separate the external resource policy from reclaim policy, which would then again be a breaking change.

I don't think this is true anymore. I'm much less confident these days that the externalResourcePolicy suggestion I included in #21 and #27 is a good direction. It was informed by my wanting to mirror the semantics of the persistent volume reclaim policy, but misunderstanding that said policy actually controls the relationship between claim and volume and between volume and external resource. @hasheddan spotted that a while back per #21 (comment) but I didn't grok it until today.

Presuming we still want to mirror the persistent volume semantics, then we don't need to (or indeed want to) introduce a second setting that determines whether external resources are deleted along with their managed resources; that is covered by reclaimPolicy.This leaves us potentially needing;

  1. A setting that determines what we do when we want to create a managed resource that already exists.
  2. A setting that determines what we do when an external resource's configuration drifts from its managed resource's configuration.

I'm quite strongly convinced these days that Crossplane should always be the source of truth for the managed resources it manages. For the above two potential configuration settings this means:

  1. The options are either to return an error or adopt the external resource into Crossplane management. If we adopt the external resource we immediately start driving it to our desired state as specified by the spec of the managed resource that adopted it.
  2. Should not be configurable. Crossplane is always the source of truth, and we always drive the external resource toward our source of truth when it diverges.

In summary, I feel we should move forward with this PR and don't expect to change how reclaimPolicy works again anytime soon. I do think we'll still want to introduce something like a .spec.adoptionPolicy that controls whether we return an error or adopt an external resource we encounter when trying to create a managed resource. Whether we do that now or later it belongs in a separate PR.

@muvaf
Copy link
Member

muvaf commented Nov 19, 2019

If a subsequent beta or stable release is interpreted to refer to the custom resource and not the software and we deem this semantic/behaviour change to be incompatible with the previous behaviour, then I agree that this warrants a version bump for beta resources.

I think this is indeed the case. Quoting from the versioning guideline:

We chose to version at the API level rather than at the resource or field level to ensure that the API presents a clear, consistent view of system resources and behavior, and to enable controlling access to end-of-life and/or experimental APIs.
...
Note that API versioning and Software versioning are only indirectly related.

If it was a new field, I'd say we don't need the bump. But it's clear to me that the user gets a different behavior now when they choose Retain policy. Before, we'd delete the managed resource CR when a claim is deleted no matter what, but now we don't.

I feel like this is more of a product direction discussion rather than technical. I don't want to block this PR. Let's discuss it under #22

@muvaf
Copy link
Member

muvaf commented Nov 19, 2019

I am not sure about the fact that when reclaim policy is set to Retain, even though the resource is retained after claim deletion, it cannot be bound to another claim. I know this is how it works in PVC/PV model but I think the cloud resources are not as application-specific/low-level as volumes. It seems like fair game to me for another application to claim a managed resource and use it.

Copy link
Member

@hasheddan hasheddan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am good with this change as an incremental step and feel as though it is in-line with the design proposed. I do think working on importing resources / using existing resources should be prioritized, but not tightly coupled to this change.

@hasheddan
Copy link
Member

Note: this will also require a slight change to core Crossplane as the ResourceClaim struct embeds BindingStatus. However, the claims would never be set to Released so I am not sure how urgent the change would be.

@hasheddan
Copy link
Member

@muvaf once we solve #22 then this could still be accomplished. The retained managed resources could be manually cleaned up, then the remaining external resources could be "imported" and used by another claim.

Copy link
Member

@muvaf muvaf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure about the fact that when reclaim policy is set to Retain, even though the resource is retained after claim deletion, it cannot be bound to another claim.

Discussed with @negz offline and now I think deferring this discussion for another time sounds like best path to move forward.

In short; I believe managed resources should be re-schedulable in some way. Things like volume or databases can be considered like one-application thing where it doesn't make a lot of sense for another application to take over but there are stateless cloud services and various scenarios where the user may want to re-use a resource after it's been released by a claim. In the future, we can introduce a ReclaimPolicy option to cover this and doing so when users ask for it makes more sense.

The resource import story will be covered #22

// Unbind the supplied Claim from the supplied Managed resource by removing the
// managed resource's claim reference, transitioning it to binding phase
// "Released", and if the managed resource's reclaim policy is "Delete",
// deleting it.
func (a *APIStatusBinder) Unbind(ctx context.Context, _ Claim, mg Managed) error {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I see that, but still I'd expect the resource to be in Unbound state when I call Unbind

@negz
Copy link
Member Author

negz commented Nov 19, 2019

Before, we'd delete the managed resource CR when a claim is deleted no matter what, but now we don't.

A small clarification - we'd delete the managed resource CR by default, but not no matter what. Whether or not the managed resource was deleted was determined by the cascading deletion policy, so kubectl delete --cascade=false someclaim would not delete the managed resource.

I am not sure about the fact that when reclaim policy is set to Retain, even though the resource is retained after claim deletion, it cannot be bound to another claim. I know this is how it works in PVC/PV model but I think the cloud resources are not as application-specific/low-level as volumes. It seems like fair game to me for another application to claim a managed resource and use it.

We discussed this on a call today, but I want to capture my (somewhat rambling) thoughts here too. I believe the following to be true with or without the change proposed by this PR:

  • Persistent volume claims and resource claims can be consumed by multiple pods simultaneously.
  • Persistent volume claims and resource claims can be consumed by multiple pods serially.

What changes with this PR is whether a managed resource can be claimed (which is distinct from used) multiple times serially, which brings us to philosophising about what is a claim? I see managed resources (and classes) as being "offerings" published by an infrastructure operator, and resource claims being just that - "claims" to one of those offerings by a particular entity. So an entity can claim a managed resource in a particular namespace, and that claim can be used by one or more entities within its namespace. At some point the entity who claimed the managed resource will not want it any more. I feel that a claim being deleted means "One or more entities were using this resource for a particular purpose or set of purposes. We no longer need this resource for that purpose or set of purposes." The question then is whether it's safe and sensible to offer the same underlying resource to another claimant, potentially in another namespace. I don't think it is a safe and sensible default to then allow another entity to claim that managed resource for a different set of purposes, given that by doing so they'll potentially gain access to any state that was left over by the previous claimant.

Put otherwise - "a big database instance" is a managed resource, while "the billing database instance" is a resource claim. By sharing or reusing at the resource claim level you're allowing multiple entities to use "the billing database". By reusing at the managed resource level you're allowing multiple entries to use "a big database instance", without consideration of the fact that the big database used to be the billing database instance.

Keep in mind the reclaim policy is set by the infrastructure operator, not the claimant, so the infrastructure operator is deciding what happens to it when it's done. The cases we support here are:

  • Delete - No cleanup needed, just delete the thing and be done with it.
  • Retain - I need or want to manually clean this thing up. Keep it around until I delete it manually. Maybe I want to ensure there's a backup of an SQL database, or want to retain a globally unique bucket name, or want to manually delete LoadBalancer type services from an EKS cluster so their ELBs are not orphaned when the cluster is deleted.

Persistent volumes support a deprecated Recycle policy that attempts to clean up a previously claimed PV and let others claim it. I'm guessing this was deprecated because it's difficult to securely erase volumes with arbitrary underlying storage technologies, and because there's just not that much point recycling a volume when you could instead treat it as cattle; delete it and dynamically provision an identical new one. I suspect these are both as or more true for our managed resources.

This commit changes the meaning of the resource claim resource policy to match
https://kubernetes.io/docs/concepts/storage/persistent-volumes/#reclaiming as
closely as possible, minus the deprecated 'Recycle' policy. Previously the
reclaim policy dictated only what happened to the external resource when its
managed resource was deleted.

Signed-off-by: Nic Cope <negz@rk0n.org>
@upbound-bot
Copy link
Collaborator

92% (-0.14%) vs master 92%

@negz negz merged commit 3706665 into crossplane:master Nov 19, 2019
@negz negz deleted the reclaim branch November 19, 2019 22:48
@muvaf
Copy link
Member

muvaf commented Nov 20, 2019

I don't think it is a safe and sensible default to then allow another entity to claim that managed resource for a different set of purposes, given that by doing so they'll potentially gain access to any state that was left over by the previous claimant.

I agree with this statement. It definitely shouldn't be the default. But I also see the value in making this possible. Opened #88 to track this discussion.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
release/backwards-compatible release/behaviour-change This PR will change controller behaviour when released.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Implement updated reclaim policy strategy
4 participants