Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rearranging control plane migration task order to ensure that all objects are deleted #2636

Merged
merged 1 commit into from
Aug 4, 2020

Conversation

krgostev
Copy link
Contributor

How to categorize this PR?

/area control-plane-migration
/kind enhancement
/priority normal

What this PR does / why we need it:
When the control plane migration is executed there is a chance that some objects are not deleted due to the task arrangement and the dependencies between them. With this PR the task order of the control plane migration are rearranged in a way that the deletion of the managed resources from the Shoot`s namespace task will wait for the deletion of the extension CRs from the Shoot namespace and the deletion of the BackupEntires from Seed.

Which issue(s) this PR fixes:
Fixes #

Special notes for your reviewer:
The commented code in shoot.go is preventing the "seedName" field in the shoot from being edited. Should this be deleted in not it it will be included in some other PR?
Release note:

The control plane migration task order is rearranged to ensure the deletion of all objects.

@krgostev krgostev requested a review from a team as a code owner July 29, 2020 08:19
@gardener-robot gardener-robot added area/control-plane-migration Control plane migration related kind/enhancement Enhancement, improvement, extension priority/normal labels Jul 29, 2020
@gardener-robot
Copy link

@kris94 Thank you for your contribution.

@CLAassistant
Copy link

CLAassistant commented Jul 29, 2020

CLA assistant check
All committers have signed the CLA.

@krgostev
Copy link
Contributor Author

/assign @plkokanov

@gardener-robot
Copy link

@kris94 Only code owners may change assignees. Commenters may only assign themselves.

Copy link
Member

@rfranzke rfranzke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pkg/apis/core/validation/shoot.go Outdated Show resolved Hide resolved
@rfranzke
Copy link
Member

Thanks, please squash the commits.

@stoyanr
Copy link
Contributor

stoyanr commented Jul 30, 2020

I tested the change, it works, and I also have no comments to the code itself. The only concern that I have is that the description in the PR about the potential issue, "there is a chance that some objects are not deleted due to the task arrangement and the dependencies between them", is fairly vague. How could one reproduce an issue with the previous arrangement, so that we could test that the new arrangement actually fixes it? Have you actually tested this?

stoyanr
stoyanr previously approved these changes Jul 30, 2020
@plkokanov
Copy link
Contributor

The problem that this should fix is the following:

  1. The managed resources are flagged with keepObjects: true
  2. The managed resources are deleted
  3. A reconciliation is triggered in one of the extension controllers which deploys new managed resources (this happens as the deletion of extension resources is later in the flow and maybe the extension controller had an error on its previous reconciliation)
  4. This causes the namespace deletion to fail
  5. On subsequent executions of the flow we just try to scale up the APIServer and ResourceManager if they already exist and the namespace is not in Terminating state (which it is) if there is no deployments we don't redeploy them. This will cause the migration flow to hang as the leftover managed resource never gets deleted.

I just want to double check the deletion of the backupentry before the deletion of the apiserver and if flagging the managedresources with keepObjects: true is ok to be after the migration of the extensions.

@plkokanov
Copy link
Contributor

plkokanov commented Jul 31, 2020

I wasn't able to reproduce the original issue, but imho the new ordering makes sense considering extension controllers could have special handling for managedresources that they deploy during the Migration operation. So it seems better to try to clean up all managedresources only after the extension controllers have been migrated and deleted.

However, I tried to migrate a hibernated shoot cluster and the migration seems to hang on:

level=info msg=Succeeded flow="Shoot's control plane preparation for migration" operation=reconcile shoot=garden-mig/migration1 task="Preparing kube-apiserver in Shoot's namespace for migration, by deleting it and its respective hvpa"
level=error msg="Failed to prepare Shoot \"migration1\" for migration: flow \"Shoot's control plane preparation for migration\" encountered task errors: [task \"Waiting until kube-apiserver doesn't exist\" failed: retry failed with context deadline exceeded]" operation=reconcile shoot=garden-mig/migration1

This could not be caused directly from this PR but could you check?

@rfranzke
Copy link
Member

rfranzke commented Aug 3, 2020

/status author-action

@gardener-robot
Copy link

@kris94 The pull request was assigned to you under author-action. Please unassign yourself when you are done. Thank you.

@swilen-iwanow
Copy link
Contributor

I wasn't able to reproduce the original issue, but imho the new ordering makes sense considering extension controllers could have special handling for managedresources that they deploy during the Migration operation. So it seems better to try to clean up all managedresources only after the extension controllers have been migrated and deleted.

However, I tried to migrate a hibernated shoot cluster and the migration seems to hang on:

level=info msg=Succeeded flow="Shoot's control plane preparation for migration" operation=reconcile shoot=garden-mig/migration1 task="Preparing kube-apiserver in Shoot's namespace for migration, by deleting it and its respective hvpa"
level=error msg="Failed to prepare Shoot \"migration1\" for migration: flow \"Shoot's control plane preparation for migration\" encountered task errors: [task \"Waiting until kube-apiserver doesn't exist\" failed: retry failed with context deadline exceeded]" operation=reconcile shoot=garden-mig/migration1

This could not be caused directly from this PR but could you check?

This is very strange I will try to reproduce it also.

@swilen-iwanow
Copy link
Contributor

I can confirm that the issue described by Plamen is not caused by this PR. For couple of reasons we've added a wakeup step in the migration flow for the kube-apiserver and this is causing an issue for hibernated Shoots because the scale down of the deployment for hibernated Shoots is skipped and the wake up step timeouts . I managed to pass around the problematic step by removing the ".Skip", but we should look into it and check if only removing the skip is OK, or more work should be done. Anyway, this is not part of this PR and we can fix this in a separate one.

@rfranzke
Copy link
Member

rfranzke commented Aug 3, 2020

@swilen-iwanow thanks for your check. So is this PR ready to be merged?

@plkokanov
Copy link
Contributor

Just needs a rebase to include the changes for the DNSEntries and then it should be ok.

Copy link
Contributor

@plkokanov plkokanov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@rfranzke rfranzke merged commit 6244954 into gardener:master Aug 4, 2020
@gardener-robot gardener-robot added priority/3 Priority (lower number equals higher priority) and removed priority/3 Priority (lower number equals higher priority) labels Mar 8, 2021
@krgostev krgostev deleted the cp-migration/fix-m-flow branch November 23, 2021 07:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/control-plane-migration Control plane migration related kind/enhancement Enhancement, improvement, extension
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

10 participants