Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Increases generic worker actuator reliability #2459

Conversation

danielfoehrKn
Copy link
Contributor

@danielfoehrKn danielfoehrKn commented Jun 16, 2020

How to categorize this PR?

/area usability
/kind enhancement
/area quality
/area robustness
/kind bug
/priority normal

What this PR does / why we need it:
Increases the reliability of the generic worker actuator to detect rolling updates and populates rolling updates in a condition in the status of the Worker CRD.

This condition is set to True as long as there is a rolling update in progress

  • until all updated machines joined the cluster (numUnavailable == 0)
  • until the old machine is deleted in the case of a rolling update with maxUnavailability = 0 (numUpdated == numberOfAwakeMachines)

Adds checks to verify that the required machine sets have been created by the MCM. This should prevent stale cache issues.

Restart stuck Machine controller manager

Restarts the machine controller manager if it failed to create the proper machine sets after unsuccessfully waiting for the machine deployments to become ready.

Bugfix

Also fixes a stale cache bug that leads to not waiting for the rolling update to complete. The check if the rolling update is complete is sometimes done on an outdated machine deployment.

Which issue(s) this PR fixes:
Fixes #

Special notes for your reviewer:
The topic of this PR is increasing the reliability of the generic worker actuator.
Because the introduced features use shared coding and are thematically coherent, I propose having them in the same PR. If that is unfeasible, let me know.

Please note that the cluster autoscaler scale down behaviour has not changed (scaled down during creation and rolling updates).
In the future, the cluster autoscaler might not be required to be scaled down any more.
See this issue and the corresponding issue on the MCM

Release note:

 Introduces a `RollingUpdate` condition in the generic worker actuator (condition.Type `RollingUpdate`) . Gardener provider extensions write this condition to the Worker CRD.
The generic worker actuator more reliably waits for rolling updates to finish.  Waits until all updated machines joined the cluster and until old machines are deleted. Also  fixes a stale cache bug that leads to not waiting for the rolling update to complete.
 The generic worker actuator detects and restarts 'stuck' machine controller manager pods.

@danielfoehrKn danielfoehrKn requested a review from a team as a code owner June 16, 2020 09:16
@gardener-robot gardener-robot added area/usability Usability related kind/bug Bug kind/enhancement Enhancement, improvement, extension priority/normal labels Jun 16, 2020
@gardener-robot
Copy link

@danielfoehrKn Thank you for your contribution.

@danielfoehrKn danielfoehrKn force-pushed the enhancement/worker-rolling-update-info branch 2 times, most recently from 21d0aed to c786d55 Compare June 16, 2020 09:38
if numUpdated >= numDesired && int(numHealthyDeployments) == len(wantedMachineDeployments) {
// numUpdated == numberOfAwakeMachines waits until the old machine is deleted in the case of a rolling update with maxUnavailability = 0
// numUnavailable == 0 makes sure that every machine joined the cluster (during creation & in the case of a rolling update with maxUnavailability > 0)
if numUnavailable == 0 && numUpdated == numberOfAwakeMachines && int(numHealthyDeployments) == len(wantedMachineDeployments) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There still could be one condition that hasn't been captured. The terminating machines (could be due to larger drain timeouts and PDBs) from the old machineSets. I think it might be okay to ignore if you think it's not too important as they are only terminating machines and new machines are already created.

To capture something like that, we might need to check all machines belonging to the old machineSet and make sure there sum is 0.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, why was this changed? IIUC it's not related to the RollingUpdate condition. Is this something that we have to improve, i.e., was the previous check not good enough? cc @prashanth26

			if numUpdated >= numDesired && int(numHealthyDeployments) == len(wantedMachineDeployments) {	

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This changes the logic that waits for the rolling update to finish.
Currently when triggering a rolling update (e.g volume resize) with maxUnavailability = 0, it does not wait for the new machine to join the cluster. This is a bug in my opinion (see Release notes).

In certain cases it did not wait for the Rolling update to complete. So the RollingUpdate condition was removed too early.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rfranzke you are right, it is not strictly related to putting a condition into the Worker status. But there is little use to such a condition, if the check for a rolling update itself does not work as expected.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think having numUnavailable == 0 should improve capturing this rolling update for longer.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think scenarios are exhaustive enough for now. looks good to me.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know it is quite some work, but I would be glad if someone would again test those (& possibly other scenarios that I did not think of).
ll do it again once all comments are resolved.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @danielfoehrKn ,

I can try to do it. However, I might be OOO until Tuesday. Would only be able to test the changes on Tuesday.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tested some scenarios, seems to be working well. Looks good to me.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for testing!

Copy link
Member

@rfranzke rfranzke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally, the PR looks OK. Earlier, I was thinking about a separate controller that is watching MachineDeployments/MachineSet`s and then decides if a rolling update is ongoing. But as a rolling update can/should only be triggered by the reconciliation loop, it actually should also be fine to put it there.

if numUpdated >= numDesired && int(numHealthyDeployments) == len(wantedMachineDeployments) {
// numUpdated == numberOfAwakeMachines waits until the old machine is deleted in the case of a rolling update with maxUnavailability = 0
// numUnavailable == 0 makes sure that every machine joined the cluster (during creation & in the case of a rolling update with maxUnavailability > 0)
if numUnavailable == 0 && numUpdated == numberOfAwakeMachines && int(numHealthyDeployments) == len(wantedMachineDeployments) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, why was this changed? IIUC it's not related to the RollingUpdate condition. Is this something that we have to improve, i.e., was the previous check not good enough? cc @prashanth26

			if numUpdated >= numDesired && int(numHealthyDeployments) == len(wantedMachineDeployments) {	

@danielfoehrKn danielfoehrKn marked this pull request as draft June 17, 2020 09:14
@danielfoehrKn danielfoehrKn force-pushed the enhancement/worker-rolling-update-info branch from c786d55 to 808ef9f Compare June 17, 2020 10:09
@danielfoehrKn danielfoehrKn force-pushed the enhancement/worker-rolling-update-info branch 7 times, most recently from 89504ac to a12bfa9 Compare June 17, 2020 14:04
@danielfoehrKn danielfoehrKn force-pushed the enhancement/worker-rolling-update-info branch from a12bfa9 to fdaf9a3 Compare June 17, 2020 14:47
@timebertt
Copy link
Member

/hold until v1.7 is released

@prashanth26
Copy link

Hi guys,

I am running a few tests to check this PR today. Please hold merging until then.

Thanks.

@ialidzhikov
Copy link
Member

@danielfoehrKn , it looks like concourse-ci/verify failed because of a flaky test which was recently fixed in the master branch with #2532. If you want, you could rebase to get this change.

@danielfoehrKn
Copy link
Contributor Author

@ialidzhikov I am expecting more comments from @prashanth26 who is currently testing. Then I ll update the PR.

@prashanth26
Copy link

prashanth26 commented Jul 1, 2020

I just tested out the PR for some shoot/worker-pool creation/update/deletion scenarios. It seems to be working well. Nice job with the PR. However, please address comments by other reviewers.

/lgtm

prashanth26
prashanth26 previously approved these changes Jul 1, 2020
@prashanth26
Copy link

prashanth26 commented Jul 1, 2020

Also i see an hold until 1.7 from @tim-ebert .

/hold

@rfranzke
Copy link
Member

rfranzke commented Jul 3, 2020

/unhold as v1.7.0 is released

@danielfoehrKn
Copy link
Contributor Author

@hardikdr @prashanth26 any additional comments? Otherwise I would like to get this PR in soon to have it thoroughly tested and included in the next release.

Copy link

@prashanth26 prashanth26 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/quality Output qualification (tests, checks, scans, automation in general, etc.) related area/robustness Robustness, reliability, resilience related area/usability Usability related kind/bug Bug kind/enhancement Enhancement, improvement, extension
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet