-
Notifications
You must be signed in to change notification settings - Fork 451
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Increases generic worker actuator reliability #2459
Increases generic worker actuator reliability #2459
Conversation
@danielfoehrKn Thank you for your contribution. |
extensions/pkg/controller/worker/genericactuator/actuator_test.go
Outdated
Show resolved
Hide resolved
21d0aed
to
c786d55
Compare
extensions/pkg/controller/worker/genericactuator/actuator_reconcile.go
Outdated
Show resolved
Hide resolved
if numUpdated >= numDesired && int(numHealthyDeployments) == len(wantedMachineDeployments) { | ||
// numUpdated == numberOfAwakeMachines waits until the old machine is deleted in the case of a rolling update with maxUnavailability = 0 | ||
// numUnavailable == 0 makes sure that every machine joined the cluster (during creation & in the case of a rolling update with maxUnavailability > 0) | ||
if numUnavailable == 0 && numUpdated == numberOfAwakeMachines && int(numHealthyDeployments) == len(wantedMachineDeployments) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There still could be one condition that hasn't been captured. The terminating machines (could be due to larger drain timeouts and PDBs) from the old machineSets. I think it might be okay to ignore if you think it's not too important as they are only terminating machines and new machines are already created.
To capture something like that, we might need to check all machines belonging to the old machineSet and make sure there sum is 0.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, why was this changed? IIUC it's not related to the RollingUpdate
condition. Is this something that we have to improve, i.e., was the previous check not good enough? cc @prashanth26
if numUpdated >= numDesired && int(numHealthyDeployments) == len(wantedMachineDeployments) {
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This changes the logic that waits for the rolling update to finish.
Currently when triggering a rolling update (e.g volume resize) with maxUnavailability = 0, it does not wait for the new machine to join the cluster. This is a bug in my opinion (see Release notes).
In certain cases it did not wait for the Rolling update to complete. So the RollingUpdate condition was removed too early.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rfranzke you are right, it is not strictly related to putting a condition into the Worker status. But there is little use to such a condition, if the check for a rolling update itself does not work as expected.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I think having numUnavailable == 0
should improve capturing this rolling update for longer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think scenarios are exhaustive enough for now. looks good to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know it is quite some work, but I would be glad if someone would again test those (& possibly other scenarios that I did not think of).
ll do it again once all comments are resolved.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @danielfoehrKn ,
I can try to do it. However, I might be OOO until Tuesday. Would only be able to test the changes on Tuesday.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tested some scenarios, seems to be working well. Looks good to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for testing!
extensions/pkg/controller/worker/genericactuator/actuator_reconcile.go
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally, the PR looks OK. Earlier, I was thinking about a separate controller that is watching MachineDeployment
s/
MachineSet`s and then decides if a rolling update is ongoing. But as a rolling update can/should only be triggered by the reconciliation loop, it actually should also be fine to put it there.
if numUpdated >= numDesired && int(numHealthyDeployments) == len(wantedMachineDeployments) { | ||
// numUpdated == numberOfAwakeMachines waits until the old machine is deleted in the case of a rolling update with maxUnavailability = 0 | ||
// numUnavailable == 0 makes sure that every machine joined the cluster (during creation & in the case of a rolling update with maxUnavailability > 0) | ||
if numUnavailable == 0 && numUpdated == numberOfAwakeMachines && int(numHealthyDeployments) == len(wantedMachineDeployments) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, why was this changed? IIUC it's not related to the RollingUpdate
condition. Is this something that we have to improve, i.e., was the previous check not good enough? cc @prashanth26
if numUpdated >= numDesired && int(numHealthyDeployments) == len(wantedMachineDeployments) {
extensions/pkg/controller/worker/genericactuator/actuator_reconcile.go
Outdated
Show resolved
Hide resolved
extensions/pkg/controller/worker/genericactuator/actuator_reconcile.go
Outdated
Show resolved
Hide resolved
c786d55
to
808ef9f
Compare
89504ac
to
a12bfa9
Compare
a12bfa9
to
fdaf9a3
Compare
3ffdc20
to
d952900
Compare
/hold until |
Hi guys, I am running a few tests to check this PR today. Please hold merging until then. Thanks. |
@danielfoehrKn , it looks like |
@ialidzhikov I am expecting more comments from @prashanth26 who is currently testing. Then I ll update the PR. |
I just tested out the PR for some shoot/worker-pool creation/update/deletion scenarios. It seems to be working well. Nice job with the PR. However, please address comments by other reviewers. /lgtm |
Also i see an /hold |
d952900
to
3c39cff
Compare
/unhold as v1.7.0 is released |
@hardikdr @prashanth26 any additional comments? Otherwise I would like to get this PR in soon to have it thoroughly tested and included in the next release. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
How to categorize this PR?
/area usability
/kind enhancement
/area quality
/area robustness
/kind bug
/priority normal
What this PR does / why we need it:
Increases the reliability of the generic worker actuator to detect rolling updates and populates rolling updates in a condition in the status of the Worker CRD.
This condition is set to
True
as long as there is a rolling update in progressnumUnavailable == 0
)numUpdated == numberOfAwakeMachines
)Adds checks to verify that the required machine sets have been created by the MCM. This should prevent stale cache issues.
Restart stuck Machine controller manager
Restarts the machine controller manager if it failed to create the proper machine sets after unsuccessfully waiting for the machine deployments to become ready.
Bugfix
Also fixes a stale cache bug that leads to not waiting for the rolling update to complete. The check if the rolling update is complete is sometimes done on an outdated machine deployment.
Which issue(s) this PR fixes:
Fixes #
Special notes for your reviewer:
The topic of this PR is increasing the reliability of the generic worker actuator.
Because the introduced features use shared coding and are thematically coherent, I propose having them in the same PR. If that is unfeasible, let me know.
Please note that the cluster autoscaler scale down behaviour has not changed (scaled down during creation and rolling updates).
In the future, the cluster autoscaler might not be required to be scaled down any more.
See this issue and the corresponding issue on the MCM
Release note: