Increases generic worker actuator reliability #2459

danielfoehrKn · 2020-06-16T09:16:13Z

How to categorize this PR?

/area usability
/kind enhancement
/area quality
/area robustness
/kind bug
/priority normal

What this PR does / why we need it:
Increases the reliability of the generic worker actuator to detect rolling updates and populates rolling updates in a condition in the status of the Worker CRD.

This condition is set to True as long as there is a rolling update in progress

until all updated machines joined the cluster (numUnavailable == 0)
until the old machine is deleted in the case of a rolling update with maxUnavailability = 0 (numUpdated == numberOfAwakeMachines)

Adds checks to verify that the required machine sets have been created by the MCM. This should prevent stale cache issues.

Restart stuck Machine controller manager

Restarts the machine controller manager if it failed to create the proper machine sets after unsuccessfully waiting for the machine deployments to become ready.

Bugfix

Also fixes a stale cache bug that leads to not waiting for the rolling update to complete. The check if the rolling update is complete is sometimes done on an outdated machine deployment.

Which issue(s) this PR fixes:
Fixes #

Special notes for your reviewer:
The topic of this PR is increasing the reliability of the generic worker actuator.
Because the introduced features use shared coding and are thematically coherent, I propose having them in the same PR. If that is unfeasible, let me know.

Please note that the cluster autoscaler scale down behaviour has not changed (scaled down during creation and rolling updates).
In the future, the cluster autoscaler might not be required to be scaled down any more.
See this issue and the corresponding issue on the MCM

Release note:

 Introduces a `RollingUpdate` condition in the generic worker actuator (condition.Type `RollingUpdate`) . Gardener provider extensions write this condition to the Worker CRD.

The generic worker actuator more reliably waits for rolling updates to finish.  Waits until all updated machines joined the cluster and until old machines are deleted. Also  fixes a stale cache bug that leads to not waiting for the rolling update to complete.

 The generic worker actuator detects and restarts 'stuck' machine controller manager pods.

gardener-robot · 2020-06-16T09:16:19Z

@danielfoehrKn Thank you for your contribution.

pkg/apis/extensions/v1alpha1/types_worker.go

extensions/pkg/controller/worker/genericactuator/actuator_test.go

extensions/pkg/controller/worker/genericactuator/actuator_reconcile.go

prashanth26 · 2020-06-16T16:22:53Z

extensions/pkg/controller/worker/genericactuator/actuator_reconcile.go

-			if numUpdated >= numDesired && int(numHealthyDeployments) == len(wantedMachineDeployments) {
+			// numUpdated == numberOfAwakeMachines waits until the old machine is deleted in the case of a rolling update with maxUnavailability = 0
+			// numUnavailable == 0 makes sure that every machine joined the cluster (during creation & in the case of a rolling update with maxUnavailability > 0)
+			if numUnavailable == 0 && numUpdated == numberOfAwakeMachines && int(numHealthyDeployments) == len(wantedMachineDeployments) {


There still could be one condition that hasn't been captured. The terminating machines (could be due to larger drain timeouts and PDBs) from the old machineSets. I think it might be okay to ignore if you think it's not too important as they are only terminating machines and new machines are already created.

To capture something like that, we might need to check all machines belonging to the old machineSet and make sure there sum is 0.

Hm, why was this changed? IIUC it's not related to the RollingUpdate condition. Is this something that we have to improve, i.e., was the previous check not good enough? cc @prashanth26

if numUpdated >= numDesired && int(numHealthyDeployments) == len(wantedMachineDeployments) {

This changes the logic that waits for the rolling update to finish.
Currently when triggering a rolling update (e.g volume resize) with maxUnavailability = 0, it does not wait for the new machine to join the cluster. This is a bug in my opinion (see Release notes).

In certain cases it did not wait for the Rolling update to complete. So the RollingUpdate condition was removed too early.

@rfranzke you are right, it is not strictly related to putting a condition into the Worker status. But there is little use to such a condition, if the check for a rolling update itself does not work as expected.

Yes, I think having numUnavailable == 0 should improve capturing this rolling update for longer.

I think scenarios are exhaustive enough for now. looks good to me.

I know it is quite some work, but I would be glad if someone would again test those (& possibly other scenarios that I did not think of).
ll do it again once all comments are resolved.

Hi @danielfoehrKn ,

I can try to do it. However, I might be OOO until Tuesday. Would only be able to test the changes on Tuesday.

I tested some scenarios, seems to be working well. Looks good to me.

Thanks a lot for testing!

extensions/pkg/controller/worker/genericactuator/actuator_reconcile.go

rfranzke

Generally, the PR looks OK. Earlier, I was thinking about a separate controller that is watching MachineDeployments/MachineSet`s and then decides if a rolling update is ongoing. But as a rolling update can/should only be triggered by the reconciliation loop, it actually should also be fine to put it there.

rfranzke · 2020-06-17T04:20:12Z

extensions/pkg/controller/worker/genericactuator/actuator_reconcile.go

-			if numUpdated >= numDesired && int(numHealthyDeployments) == len(wantedMachineDeployments) {
+			// numUpdated == numberOfAwakeMachines waits until the old machine is deleted in the case of a rolling update with maxUnavailability = 0
+			// numUnavailable == 0 makes sure that every machine joined the cluster (during creation & in the case of a rolling update with maxUnavailability > 0)
+			if numUnavailable == 0 && numUpdated == numberOfAwakeMachines && int(numHealthyDeployments) == len(wantedMachineDeployments) {


Hm, why was this changed? IIUC it's not related to the RollingUpdate condition. Is this something that we have to improve, i.e., was the previous check not good enough? cc @prashanth26

if numUpdated >= numDesired && int(numHealthyDeployments) == len(wantedMachineDeployments) {

extensions/pkg/controller/worker/genericactuator/actuator_reconcile.go

timebertt · 2020-06-30T08:54:01Z

/hold until v1.7 is released

prashanth26 · 2020-07-01T06:21:54Z

Hi guys,

I am running a few tests to check this PR today. Please hold merging until then.

Thanks.

ialidzhikov · 2020-07-01T08:07:17Z

@danielfoehrKn , it looks like concourse-ci/verify failed because of a flaky test which was recently fixed in the master branch with #2532. If you want, you could rebase to get this change.

danielfoehrKn · 2020-07-01T11:52:14Z

@ialidzhikov I am expecting more comments from @prashanth26 who is currently testing. Then I ll update the PR.

prashanth26 · 2020-07-01T13:05:04Z

I just tested out the PR for some shoot/worker-pool creation/update/deletion scenarios. It seems to be working well. Nice job with the PR. However, please address comments by other reviewers.

/lgtm

prashanth26 · 2020-07-01T13:06:58Z

Also i see an hold until 1.7 from @tim-ebert .

/hold

rfranzke · 2020-07-03T11:47:41Z

/unhold as v1.7.0 is released

danielfoehrKn · 2020-07-08T21:30:21Z

@hardikdr @prashanth26 any additional comments? Otherwise I would like to get this PR in soon to have it thoroughly tested and included in the next release.

prashanth26

/lgtm

danielfoehrKn requested a review from a team as a code owner June 16, 2020 09:16

gardener-robot added area/usability Usability related kind/bug Bug kind/enhancement Enhancement, improvement, extension priority/normal labels Jun 16, 2020

gardener-robot-ci-2 added the reviewed/ok-to-test label Jun 16, 2020

rfranzke reviewed Jun 16, 2020

View reviewed changes

pkg/apis/extensions/v1alpha1/types_worker.go Outdated Show resolved Hide resolved

rfranzke reviewed Jun 16, 2020

View reviewed changes

extensions/pkg/controller/worker/genericactuator/actuator_test.go Outdated Show resolved Hide resolved

danielfoehrKn force-pushed the enhancement/worker-rolling-update-info branch 2 times, most recently from 21d0aed to c786d55 Compare June 16, 2020 09:38

danielfoehrKn requested review from hardikdr and prashanth26 June 16, 2020 15:11

prashanth26 reviewed Jun 16, 2020

View reviewed changes

gardener-robot-ci-3 added needs/ok-to-test and removed reviewed/ok-to-test labels Jun 17, 2020

rfranzke reviewed Jun 17, 2020

View reviewed changes

danielfoehrKn marked this pull request as draft June 17, 2020 09:14

danielfoehrKn force-pushed the enhancement/worker-rolling-update-info branch from c786d55 to 808ef9f Compare June 17, 2020 10:09

gardener-robot-ci-1 added the reviewed/ok-to-test label Jun 17, 2020

danielfoehrKn force-pushed the enhancement/worker-rolling-update-info branch 7 times, most recently from 89504ac to a12bfa9 Compare June 17, 2020 14:04

gardener-robot-ci-3 removed the reviewed/ok-to-test label Jun 17, 2020

danielfoehrKn force-pushed the enhancement/worker-rolling-update-info branch from a12bfa9 to fdaf9a3 Compare June 17, 2020 14:47

gardener-robot-ci-2 added the reviewed/ok-to-test label Jun 17, 2020

danielfoehrKn force-pushed the enhancement/worker-rolling-update-info branch from 3ffdc20 to d952900 Compare June 30, 2020 08:48

gardener-robot-ci-1 added the reviewed/ok-to-test label Jun 30, 2020

gardener-robot added the reviewed/do-not-merge label Jun 30, 2020

gardener-robot-ci-2 removed the reviewed/ok-to-test label Jun 30, 2020

gardener-robot added reviewed/lgtm and removed reviewed/do-not-merge labels Jul 1, 2020

prashanth26 previously approved these changes Jul 1, 2020

View reviewed changes

gardener-robot added reviewed/do-not-merge and removed reviewed/lgtm labels Jul 1, 2020

increase reliability of the worker actuator

3c39cff

danielfoehrKn dismissed prashanth26’s stale review via 3c39cff July 1, 2020 14:33

danielfoehrKn force-pushed the enhancement/worker-rolling-update-info branch from d952900 to 3c39cff Compare July 1, 2020 14:33

gardener-robot-ci-1 added reviewed/ok-to-test and removed reviewed/ok-to-test labels Jul 1, 2020

gardener-robot removed the reviewed/do-not-merge label Jul 3, 2020

hardikdr approved these changes Jul 9, 2020

View reviewed changes

prashanth26 approved these changes Jul 9, 2020

View reviewed changes

gardener-robot added the reviewed/lgtm label Jul 9, 2020

timuthy approved these changes Jul 9, 2020

View reviewed changes

danielfoehrKn merged commit 2e99338 into gardener:master Jul 9, 2020

prashanth26 mentioned this pull request Aug 4, 2020

Fix cluster hibernation issue on worker controller #2658

Merged

rfranzke mentioned this pull request Sep 29, 2020

Improve worker status summary during rolling update #2914

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Increases generic worker actuator reliability #2459

Increases generic worker actuator reliability #2459

danielfoehrKn commented Jun 16, 2020 •

edited

gardener-robot commented Jun 16, 2020

prashanth26 Jun 16, 2020

rfranzke Jun 17, 2020

danielfoehrKn Jun 17, 2020

danielfoehrKn Jun 17, 2020

prashanth26 Jun 18, 2020

prashanth26 Jun 26, 2020

danielfoehrKn Jun 26, 2020 •

edited

prashanth26 Jun 28, 2020

prashanth26 Jul 1, 2020

danielfoehrKn Jul 1, 2020

rfranzke left a comment

rfranzke Jun 17, 2020

timebertt commented Jun 30, 2020

prashanth26 commented Jul 1, 2020

ialidzhikov commented Jul 1, 2020

danielfoehrKn commented Jul 1, 2020

prashanth26 commented Jul 1, 2020 •

edited

prashanth26 commented Jul 1, 2020 •

edited

rfranzke commented Jul 3, 2020

danielfoehrKn commented Jul 8, 2020

prashanth26 left a comment

Increases generic worker actuator reliability #2459

Increases generic worker actuator reliability #2459

Conversation

danielfoehrKn commented Jun 16, 2020 • edited

gardener-robot commented Jun 16, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

danielfoehrKn Jun 26, 2020 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rfranzke left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

timebertt commented Jun 30, 2020

prashanth26 commented Jul 1, 2020

ialidzhikov commented Jul 1, 2020

danielfoehrKn commented Jul 1, 2020

prashanth26 commented Jul 1, 2020 • edited

prashanth26 commented Jul 1, 2020 • edited

rfranzke commented Jul 3, 2020

danielfoehrKn commented Jul 8, 2020

prashanth26 left a comment

Choose a reason for hiding this comment

danielfoehrKn commented Jun 16, 2020 •

edited

danielfoehrKn Jun 26, 2020 •

edited

prashanth26 commented Jul 1, 2020 •

edited

prashanth26 commented Jul 1, 2020 •

edited