Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Integrate enhanced meltdown handling for dependency-watchdog probe #5497
Integrate enhanced meltdown handling for dependency-watchdog probe #5497
Changes from 2 commits
4e69680
6e52ca2
2d60faf
df84f18
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why we are importing a pkg from the extension library for a constant usage? I guess this is also why
make verify
fails (import-boss not allowing to import the extension library from./pkg
).There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I couldn't find any constant in g/g for this in
v1beta1constants
. So I used the one from the extension as I didn't want to introduce a new one if it goes untracked.However as suggested by @rfranzke in #5497 (comment) to add it in v1beta1constants.
So this is fixed with the commit
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With GEP-01 (extensibility) cloudprovider specific details are extracted to extensions. gardenlet does not need to know anything about MCM or cannot make any assumption that MCM is used. Actually from gardenlet's point of view there is only the Worker resource and that is the contract. The fact that provider extensions choose to deploy MCM as part of the Worker reconciliation is not a thing that gardenlet has to assume. IMO this PR is violating GEP-01 as it is making gardenlet to configure dependency-watchdog assuming that MCM is used. In theory, a provider extension can implement the contract without using MCM.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ialidzhikov See #5497 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree to your concern @ialidzhikov.
Do you suggest an alternate approach here or you are fine with adding MCM deployment name in the constants.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like it is hard to find out something with the new
scaleRefDependsOn
approach. Generally, having in mind the old config I was thinking of a well-known label that is configured in the dependency-watchdog-probe scales down all Deployments that match the well-known label - this allows extensions using MCM to add the well-known label to the MCM Deployment and in this way to "request" the MCM to be scaled down.Do we need actually
scaleRefDependsOn
? Can't we simply scale down all components when the probe fails?Assuming that the
scaleRefDependsOn
handling is needed, I am "fine" with the current approach because I cannot think of a good alternative.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can think of it. Currently I didn't want to introduce new semantics of identifying a scale resources. As the implementation already requires us to provide MCM as part of scaleRef as mentioned below
So
scaleRefDependsOn
is not the issue here, if we want to do what you suggest we also need to change the design forscaleRef
itself to have a new approach for selecting the deployment.The logic currently works seamless for both ScaleUp and ScaleDowns.
Like what you mentioned scalingDown is done all at once. But, we need to consider the semantics of ScaleUp here. This is where we have a problem today, we scale up everything at once which is detrimental to even introduce MCM additionally along with KCM. As once MCM comes ups it will mark the nodes as
Unknown
before giving KCM any chance to update the node status and will start removing them. To avoid this we wish to delay it usingscaleUpDelaySeconds
.However, even then we run the risk of KCM not being available and just starting MCM with some delay without checking if KCM is up will lead us to the same issue. So we introduced the
scaleRefDependOn
semantics to avoid running MCM on a state of the system which is not yet updated by KCM.To not reinvent the wheel
scaleRefDependsOn
is just an array of same type asscaleRef
.We will explore if there is a better way to do it without breaking the extension contract for Gardener.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I again suggest to be pragmatic here and rather focus on getting this change in so that we can move on the other (blocking) issues like gardener/dependency-watchdog#36. IMO it's not problematic to specify MCM here as explained above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
However, can you create an issue on dependency-watchdog side to make sure that we don't forget about this and this item is considered/worked on/brainstormed on when there is capacity? Thanks in advance!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ialidzhikov As suggested created an Issue on DWD to track this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens if the Shoot does not enable autoscaling -> the cluster-autoscaler deployment is not present?
1 drawback I see is that in this case the logs are "polluted" with such error logs. And this is only 1 Shoot, image 50 Shoots on a Seed to have cluster-autoscaler disabled. dependency-watchdog should rather know that this component is optional and log in info level something like "the Deployment is not present, hence skipping it".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this does not cause issues for the dependency-watchdog working, then I am also fine to create an issue on dependency-watchdog side about the verbose error logging and hope that this can be fixed with an upcoming version of the component.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A totally valid concern. I've filed an Issue with the DWD repo. Also filed a PR for the fix and it also vendor Go 1.17 to be part of the patch release. v0.7.1
The new logs when the cluster-autoscaler is available will be:
When
cluster-autoscaler
deployment is not present will be:Once its merged I can also cut a patch release and merge it here or vendor it later if it takes a lot more time to merge.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the follow-up. Should we wait for dependency-watchdog@v0.7.1 as part of this PR or should we proceed with dependency-watchdog@v0.7.0?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll check with @shreyas-s-rao if we can merge today and release a patch. In that case we can go with v0.7.1.
If we can not release it today, then you can go ahead and we vendor it along with the bug fix for secret rotation issue which will require a patch release.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ialidzhikov I checked however we won't be able to cut the release today, as Shreyas won't be able to complete the review today due to other things at hand. So either we wait till tomorrow as we are confident of releasing it tomorrow or we can vendor it with the next set of changes for DWD.
Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.
Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.
Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.