New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Integrate enhanced meltdown handling for dependency-watchdog probe #5497
Conversation
Currently the e2e tests are failing as I've imported |
/assign |
Actually, MCM is an implementation detail of the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please also bump the image in charts/images.yaml
and incorporate the release notes from #5494 so that we have everything related to the change in one PR.
pkg/operation/botanist/component/dependencywatchdog/dependency_watchdog_test.go
Outdated
Show resolved
Hide resolved
b9d4727
to
6e52ca2
Compare
@rfranzke Should the k8s dependencies and c-r also be bumped to latest, I checked DWD is using v0.17 of k8s dependencies and v0.5.5 of c-r. |
@acumino I think this is not related to this PR but rather to the maintenance of https://github.com/gardener/dependency-watchdog/, is it? |
Yes, but @ashwani2k mentioned he will cut the patch release with latest go version, I guess k8s dependencies should be update alongside. WDYT? |
I don't know whether this is strictly required, while it's certainly a reasonable thing to do. This must be decided by the maintainers of https://github.com/gardener/dependency-watchdog/. |
@@ -15,6 +15,7 @@ | |||
package kubeapiserver | |||
|
|||
import ( | |||
worker "github.com/gardener/gardener/extensions/pkg/controller/worker/genericactuator" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why we are importing a pkg from the extension library for a constant usage? I guess this is also why make verify
fails (import-boss not allowing to import the extension library from ./pkg
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I couldn't find any constant in g/g for this in v1beta1constants
. So I used the one from the extension as I didn't want to introduce a new one if it goes untracked.
However as suggested by @rfranzke in #5497 (comment) to add it in v1beta1constants.
So this is fixed with the commit
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With GEP-01 (extensibility) cloudprovider specific details are extracted to extensions. gardenlet does not need to know anything about MCM or cannot make any assumption that MCM is used. Actually from gardenlet's point of view there is only the Worker resource and that is the contract. The fact that provider extensions choose to deploy MCM as part of the Worker reconciliation is not a thing that gardenlet has to assume. IMO this PR is violating GEP-01 as it is making gardenlet to configure dependency-watchdog assuming that MCM is used. In theory, a provider extension can implement the contract without using MCM.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree to your concern @ialidzhikov.
Do you suggest an alternate approach here or you are fine with adding MCM deployment name in the constants.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like it is hard to find out something with the new scaleRefDependsOn
approach. Generally, having in mind the old config I was thinking of a well-known label that is configured in the dependency-watchdog-probe scales down all Deployments that match the well-known label - this allows extensions using MCM to add the well-known label to the MCM Deployment and in this way to "request" the MCM to be scaled down.
Do we need actually scaleRefDependsOn
? Can't we simply scale down all components when the probe fails?
Assuming that the scaleRefDependsOn
handling is needed, I am "fine" with the current approach because I cannot think of a good alternative.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can think of it. Currently I didn't want to introduce new semantics of identifying a scale resources. As the implementation already requires us to provide MCM as part of scaleRef as mentioned below
scaleRef:
apiVersion: apps/v1
kind: Deployment
name: machine-controller-manager
So scaleRefDependsOn
is not the issue here, if we want to do what you suggest we also need to change the design for scaleRef
itself to have a new approach for selecting the deployment.
Do we need actually scaleRefDependsOn? Can't we simply scale down all components when the probe fails?
The logic currently works seamless for both ScaleUp and ScaleDowns.
Like what you mentioned scalingDown is done all at once. But, we need to consider the semantics of ScaleUp here. This is where we have a problem today, we scale up everything at once which is detrimental to even introduce MCM additionally along with KCM. As once MCM comes ups it will mark the nodes as Unknown
before giving KCM any chance to update the node status and will start removing them. To avoid this we wish to delay it using scaleUpDelaySeconds
.
However, even then we run the risk of KCM not being available and just starting MCM with some delay without checking if KCM is up will lead us to the same issue. So we introduced the scaleRefDependOn
semantics to avoid running MCM on a state of the system which is not yet updated by KCM.
To not reinvent the wheel scaleRefDependsOn
is just an array of same type as scaleRef
.
We will explore if there is a better way to do it without breaking the extension contract for Gardener.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I again suggest to be pragmatic here and rather focus on getting this change in so that we can move on the other (blocking) issues like gardener/dependency-watchdog#36. IMO it's not problematic to specify MCM here as explained above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We will explore if there is a better way to do it without breaking the extension contract for Gardener.
However, can you create an issue on dependency-watchdog side to make sure that we don't forget about this and this item is considered/worked on/brainstormed on when there is capacity? Thanks in advance!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ialidzhikov As suggested created an Issue on DWD to track this.
pkg/operation/botanist/component/dependencywatchdog/dependency_watchdog.go
Show resolved
Hide resolved
pkg/operation/botanist/component/kubeapiserver/dependency_watchdog_test.go
Outdated
Show resolved
Hide resolved
pkg/operation/botanist/component/kubeapiserver/dependency_watchdog_test.go
Outdated
Show resolved
Hide resolved
@@ -15,6 +15,7 @@ | |||
package kubeapiserver | |||
|
|||
import ( | |||
worker "github.com/gardener/gardener/extensions/pkg/controller/worker/genericactuator" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We will explore if there is a better way to do it without breaking the extension contract for Gardener.
However, can you create an issue on dependency-watchdog side to make sure that we don't forget about this and this item is considered/worked on/brainstormed on when there is capacity? Thanks in advance!
ScaleRef: autoscalingv1.CrossVersionObjectReference{ | ||
APIVersion: appsv1.SchemeGroupVersion.String(), | ||
Kind: "Deployment", | ||
Name: v1beta1constants.DeploymentNameClusterAutoscaler, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens if the Shoot does not enable autoscaling -> the cluster-autoscaler deployment is not present?
E0304 07:15:46.904892 1 prober.go:405] Scaling up dependents of shoot-kube-apiserver/shoot--foo--bar: apps/v1.Deployment/cluster-autoscaler: replicas=1: failed
E0304 07:15:46.906923 1 prober.go:452] Scaling up dependents of shoot-kube-apiserver/shoot--foo--bar: apps/v1.Deployment/cluster-autoscaler: error getting deployments.apps: deployments/scale.apps "cluster-autoscaler" not found
E0304 07:15:46.906938 1 prober.go:500] Scaling up dependents of shoot-kube-apiserver/shoot--foo--bar: apps/v1.Deployment/cluster-autoscaler: Could not get target reference: deployments/scale.apps "cluster-autoscaler" not found
E0304 07:15:46.906943 1 prober.go:501] Scaling up dependents of shoot-kube-apiserver/shoot--foo--bar: apps/v1.Deployment/cluster-autoscaler: replicas=1: failed
1 drawback I see is that in this case the logs are "polluted" with such error logs. And this is only 1 Shoot, image 50 Shoots on a Seed to have cluster-autoscaler disabled. dependency-watchdog should rather know that this component is optional and log in info level something like "the Deployment is not present, hence skipping it".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this does not cause issues for the dependency-watchdog working, then I am also fine to create an issue on dependency-watchdog side about the verbose error logging and hope that this can be fixed with an upcoming version of the component.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A totally valid concern. I've filed an Issue with the DWD repo. Also filed a PR for the fix and it also vendor Go 1.17 to be part of the patch release. v0.7.1
The new logs when the cluster-autoscaler is available will be:
| I0307 15:11:17.440689 1 prober.go:368] shoot-kube-apiserver/shoot--<project>--<shootname>/external: probe result: &scaler.probeResult{lastError:(*url.Error)(0xc000523200), resultRun:4} │
│ I0307 15:11:17.440816 1 prober.go:414] Scaling down dependents of shoot-kube-apiserver/shoot--<project>--<shootname>: apps/v1.Deployment/kube-controller-manager: skipped because desired=0 and current=0 │
│ I0307 15:11:17.441299 1 prober.go:414] Scaling down dependents of shoot-kube-apiserver/shoot--<project>--<shootname>: apps/v1.Deployment/machine-controller-manager: skipped because desired=0 and current=0 │
│ I0307 15:11:17.441319 1 prober.go:414] Scaling down dependents of shoot-kube-apiserver/shoot--<project>--<shootname>: apps/v1.Deployment/cluster-autoscaler: skipped because desired=0 and current=0 │
│ I0307 15:11:37.582672 1 reflector.go:268] github.com/gardener/gardener/pkg/client/extensions/informers/externalversions/factory.go:117: forcing resync │
When cluster-autoscaler
deployment is not present will be:
│ I0307 15:11:37.582763 1 scaler.go:67] Update event on cluster: shoot--<project>--<shootname> │
│ I0307 15:11:38.692669 1 reflector.go:268] k8s.io/client-go/informers/factory.go:135: forcing resync │
│ I0307 15:11:51.013434 1 prober.go:353] shoot-kube-apiserver/shoot--<project>--<shootname>/internal: probe succeeded │
│ I0307 15:11:51.013449 1 prober.go:368] shoot-kube-apiserver/shoot--<project>--<shootname>/internal: probe result: &scaler.probeResult{lastError:error(nil), resultRun:4} │
│ I0307 15:11:51.024917 1 prober.go:356] shoot-kube-apiserver/shoot--<project>--<shootname>/external: probe failed with error: Get "https://api.<shootname>.<cluster-address>.com/version?timeout=10s": dial tcp: lookup api.<shootname>.cluster-address>.com on 100.64.0.10:53: no such host. Will retry... │
│ I0307 15:11:51.024987 1 prober.go:368] shoot-kube-apiserver/shoot--<project>--<shootname>/external: probe result: &scaler.probeResult{lastError:(*url.Error)(0xc0005a1230), resultRun:4} │
│ I0307 15:11:51.025030 1 prober.go:414] Scaling down dependents of shoot-kube-apiserver/shoot--<project>--<shootname>: apps/v1.Deployment/kube-controller-manager: skipped because desired=0 and current=0 │
│ I0307 15:11:51.025280 1 prober.go:414] Scaling down dependents of shoot-kube-apiserver/shoot--<project>--<shootname>: apps/v1.Deployment/machine-controller-manager: skipped because desired=0 and current=0 │
│ E0307 15:11:51.025351 1 prober.go:405] Scaling down dependents of shoot-kube-apiserver/shoot--<project>--<shootname>: apps/v1.Deployment/cluster-autoscaler: Skipped as target reference: deployment.apps "cluster-autoscaler" not found │
│ I0307 15:12:07.583315 1 reflector.go:268] github.com/gardener/gardener/pkg/client/extensions/informers/externalversions/factory.go:117: forcing resync │
│ I0307 15:12:07.583545 1 scaler.go:67] Update event on cluster: shoot--<project>--<shootname>
Once its merged I can also cut a patch release and merge it here or vendor it later if it takes a lot more time to merge.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the follow-up. Should we wait for dependency-watchdog@v0.7.1 as part of this PR or should we proceed with dependency-watchdog@v0.7.0?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll check with @shreyas-s-rao if we can merge today and release a patch. In that case we can go with v0.7.1.
If we can not release it today, then you can go ahead and we vendor it along with the bug fix for secret rotation issue which will require a patch release.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ialidzhikov I checked however we won't be able to cut the release today, as Shreyas won't be able to complete the review today due to other things at hand. So either we wait till tomorrow as we are confident of releasing it tomorrow or we can vendor it with the next set of changes for DWD.
baf409c
to
94086ac
Compare
pkg/operation/botanist/component/kubeapiserver/dependency_watchdog_test.go
Outdated
Show resolved
Hide resolved
pkg/operation/botanist/component/kubeapiserver/dependency_watchdog_test.go
Outdated
Show resolved
Hide resolved
pkg/operation/botanist/component/kubeapiserver/dependency_watchdog_test.go
Outdated
Show resolved
Hide resolved
pkg/operation/botanist/component/kubeapiserver/dependency_watchdog_test.go
Outdated
Show resolved
Hide resolved
94086ac
to
df84f18
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
/hold
for a while because of #5497 (comment)
/unhold |
@kris94 Can you please add the release milestone and merge this one? |
/reviewed/ok-to-test |
…ardener#5497) * Vendored dependency-watchdog 0.7.0 * Adapted dependency-watchdog component to work with DWD v0.7.0 * Updated charts for DWD and adapted for MCM deployment under v1beta1 constants * Adapted RBAC for dependency-watchdog-endpoint as well
…ardener#5497) * Vendored dependency-watchdog 0.7.0 * Adapted dependency-watchdog component to work with DWD v0.7.0 * Updated charts for DWD and adapted for MCM deployment under v1beta1 constants * Adapted RBAC for dependency-watchdog-endpoint as well
How to categorize this PR?
/area disaster-recovery
/area control-plane
/kind enhancement
/squash
What this PR does / why we need it:
This PR integrates the changes brought in with release v0.7.0 of
dependency-watchdog
.The release brings in 2 prominent changes.
The changes introduced with the DWD PR#39 are -
As a result of the above 2 changes the new config file is adapted to be bring down
kube-controller-manager
along withmachine-controller-manager
andcluster-autoscaler
.Also while scaling up we give a 120 seconds delay to let the
kubelet
update the node status before scaling upkube-controller-manager
. We give another 60 seconds delay tomachine-controller-manager
afterkube-controller-manager
is up to ensure that the node status is updated correctly byKCM
.cluster-autoscalar
shall come up immediately oncemachine-controller-manager
is up.This shall ensure that we don't accidentally let MCM and Cluster Autoscaler to replace machines and wait for the api-server availability before taking any action.
These new changes are reflected with new config file as depicted below and are accordingly adapted in the component files for dependency-watchdog.
The second list of changes are the ones introduced with the switch for leader election to endpoint leases PR#37 introduced by @ary1992.
These change requires the existing cluster role for the
dependency-watchdog-probe
to be enhanced with following -Which issue(s) this PR fixes:
Fixes #
Special notes for your reviewer:
Go
release and is still working with1.13
. It was an oversight as there has been no release forDWD
for sometime, but we wish to include it in the imminent patch release because we still have some pending tasks to follow up -README.md
with the changes introduced.DWD
in reading thekubeconfigs
when loading the rotated secrets#PR36In lieu of these imminent changes we can upgrade
Go
version to 1.17.7 with the next patch release ofDWD
to be integrated withg/g
.Release note: