Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ability to replay deletion with Usage spec #5394

Merged
merged 2 commits into from
Mar 9, 2024

Conversation

turkenh
Copy link
Member

@turkenh turkenh commented Feb 15, 2024

Description of your changes

This PR adds a new spec field to the Usage resource, namely replayDeletion, which would enable replaying a blocked deletion attempt right after the Usage is gone.

When a deletion is blocked, the webhook will add an annotation to the used resource about that.
When the Usage is being deleted, if replayDeletion is set to true, the controller will trigger a delete request on the used object if there is such annotation.

A possible fix for #5193

I have:

Need help with this checklist? See the cheat sheet.

// We do the deletion async and after some delay to make sure the usage is deleted before the
// deletion attempt. We remove the finalizer on this Usage right below, so, we know it will disappear
// very soon.
time.Sleep(2 * time.Second)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the piece that I don't like most (which is why the PR is draft).
But without doing async with some delay, it becomes a chicken and egg problem, i.e. as long as the Usage is there, deletions will be blocked. /cc @bobh66

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree it's not ideal, but this is the same solution I ended up using for the python implementation which has been running for over a year in our deployments.

Signed-off-by: Hasan Turken <turkenh@gmail.com>
(cherry picked from commit b0dad402cbb5f939e068d3fa49c6a2260c1ed57d)
@haarchri
Copy link
Contributor

tested this implementation deletion with usage is in my case 56 minutes faster

@turkenh
Copy link
Member Author

turkenh commented Feb 29, 2024

I fixed unit tests and marked the PR as ready for review! /cc @bobh66 @negz

I've also built and pushed an image with the latest state, which can be installed with:

helm repo add crossplane-master https://charts.crossplane.io/master --force-update
helm upgrade --install crossplane --namespace crossplane-system crossplane-master/crossplane --devel --create-namespace --set image.repository=index.docker.io/turkenh/crossplane --set image.tag=v1.16.0-rc.0.53.gb3a366f2 --set "args={--debug,--enable-usages}"

I'd appreciate some help with testing various cases, with or without replayDeletion: true

Copy link
Contributor

@bobh66 bobh66 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM - just one question inline.

@@ -68,6 +68,9 @@ type UsageSpec struct {
// Reason is the reason for blocking deletion of the resource.
// +optional
Reason *string `json:"reason,omitempty"`
// ReplayDeletion will trigger a deletion on the used resource during the deletion of the usage itself, if it was attempted to be deleted at least once.
// +optional
ReplayDeletion *bool `json:"replayDeletion,omitempty"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense to default this to true? That would provide the best user experience. I don't know why someone wouldn't want this to happen, but it's good to have the option to disable it.

Copy link
Member Author

@turkenh turkenh Mar 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer to start with false and consider enabling it by default while bumping the API to beta, where we will have more freedom to make behavioral changes.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added as a todo item to the promote to beta ticket: #4622

// very soon.
time.Sleep(2 * time.Second)
// We cannot use the context from the reconcile function since it will be cancelled after the reconciliation.
_ = r.client.Delete(context.Background(), used)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this have to consider passing the right DeleteOptions for foreground propagationPolicy ?

What if the used resource is from a different composition than the Usage resource and the by/using resource and happens to be an XR composite from a claim with Foreground compositeDeletionPolicy. In that case, letting the claim reconciler retrying the delete with appropriate deleteOptions seems to be the right thing ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this have to consider passing the right DeleteOptions for foreground propagationPolicy ?

That's a good point - it would make sense to use the same DeleteOptions, and in my python implementation of this feature I was using Foreground deletion here by default (buried in a method default parameter so I wasn't thinking about it). Theoretically we're just trying to wake up the GC process early, since it has already marked things for deletion and set finalizers I don't think the propagation policy matters here, but it is possible that it could make a difference. Probably better to set it than not. I think we could check for a foregroundDeletion finalizer on the resource and if it exists then retry with Foreground propagation policy otherwise let it default to Background.

I don't know if retrying with the Claim reconciler will affect the lower-level resources that GC is trying to delete. The delete call on the Claim would need to trigger GC on the lower-level resources and I don't know if that happens or not.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Theoretically we're just trying to wake up the GC process early, since it has already marked things for deletion and set finalizers I don't think the propagation policy matters here, but it is possible that it could make a difference. Probably better to set it than not. I think we could check for a foregroundDeletion finalizer on the resource and if it exists then retry with Foreground propagation policy otherwise let it default to Background.

Did you mean checking for the foregroundDeletion finalizer presence on the parent of used resource (from its ownerReferences) ? the used resource itself wouldn't have either metadata.deletionTimestamp or the foregroundDeletion finalizer set on it at this stage, as the presence of Usage would have rejected the previous deletion attempts.

I don't know if retrying with the Claim reconciler will affect the lower-level resources that GC is trying to delete. The delete call on the Claim would need to trigger GC on the lower-level resources and I don't know if that happens or not.

If the used/of resource referred in the Usage happens to be a nested XR or MR composed as part of a composite (child resources with metadata.OwnerReferences), then only it is possible for the K8s GC Controller to be the one attempting delete, right ?

If the used/of resource referred in the Usage happens to be a parent XR(without any metadata.OwnerReferences set) from a different composition than the using/Usage itself, the flow of deletion would be :
External Client issues delete on the Claim with Foreground compositeDeletionPolicy -> Claim reconciler attempts to delete the XR with foreground propagation policy -> delete fails due to Usage -> claim reconciler keeps attempting delete on XR with requeues.. there are no GC controller invoked deletes on such a parent XR, in this scenario..

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call @ravilr 👍
I updated the PR to replay with the same propagation policy.

Signed-off-by: Hasan Turken <turkenh@gmail.com>
@turkenh turkenh merged commit 77bddd1 into crossplane:master Mar 9, 2024
17 checks passed
Copy link

github-actions bot commented Mar 9, 2024

Backport failed for release-1.14, because it was unable to cherry-pick the commit(s).

Please cherry-pick the changes locally and resolve any conflicts.

git fetch origin release-1.14
git worktree add -d .worktree/backport-5394-to-release-1.14 origin/release-1.14
cd .worktree/backport-5394-to-release-1.14
git switch --create backport-5394-to-release-1.14
git cherry-pick -x b3a366f2036abade23e5af30cb2e3e9f4934c3e3 ef512a0950ad0f4adc8ab4b0a5965020ae4385c7

Copy link

github-actions bot commented Mar 9, 2024

Backport failed for release-1.15, because it was unable to cherry-pick the commit(s).

Please cherry-pick the changes locally and resolve any conflicts.

git fetch origin release-1.15
git worktree add -d .worktree/backport-5394-to-release-1.15 origin/release-1.15
cd .worktree/backport-5394-to-release-1.15
git switch --create backport-5394-to-release-1.15
git cherry-pick -x b3a366f2036abade23e5af30cb2e3e9f4934c3e3 ef512a0950ad0f4adc8ab4b0a5965020ae4385c7

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants