Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove CPU limits from gardener components #5627

Merged
merged 1 commit into from
Mar 31, 2022

Conversation

stoyanr
Copy link
Contributor

@stoyanr stoyanr commented Mar 22, 2022

How to categorize this PR?

/kind enhancement

What this PR does / why we need it:

  • Removes CPU limits from all components to prevent CPU throttling due to reaching limits.

The following will be added in separate PRs:

  • Remove or adjust memory limits from all components to match actual usage on landscapes.
  • Fix VerticalPodAutoscaler resources to use RequestsOnly instead of RequestsAndLimits
  • Fix VerticalPodAutoscaler and Hvpa resources to remove MaxAllowed where it's present

Which issue(s) this PR fixes:
Fixes #

Special notes for your reviewer:

Release note:

CPU limits from all gardener components have been removed to prevent CPU throttling due to reaching limits.

@stoyanr stoyanr requested a review from a team as a code owner March 22, 2022 15:26
@stoyanr stoyanr marked this pull request as draft March 22, 2022 15:26
@gardener-robot gardener-robot added the kind/enhancement Enhancement, improvement, extension label Mar 22, 2022
@gardener-robot
Copy link

@stoyanr Label area/controlplane does not exist.

@gardener-robot gardener-robot added needs/review size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Mar 22, 2022
@gardener-robot gardener-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. needs/second-opinion and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Mar 23, 2022
@stoyanr stoyanr force-pushed the enh/remove-or-adjust-limits branch from 398be0c to c79afde Compare March 23, 2022 14:40
@gardener-robot gardener-robot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Mar 23, 2022
@gardener-robot
Copy link

@stoyanr Label area/controlplane does not exist.

@stoyanr stoyanr marked this pull request as ready for review March 23, 2022 14:43
@stoyanr
Copy link
Contributor Author

stoyanr commented Mar 23, 2022

/invite @danielfoehrKn @voelzmo

@stoyanr stoyanr changed the title Remove CPU limits and adjust memory limits Remove CPU limits from gardener components Mar 23, 2022
@gardener-robot
Copy link

@stoyanr Label area/controlplane does not exist.

@stoyanr
Copy link
Contributor Author

stoyanr commented Mar 23, 2022

I consider this PR ready for review now.

rfranzke
rfranzke previously approved these changes Mar 24, 2022
Copy link
Member

@rfranzke rfranzke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@vlerenc
Copy link
Member

vlerenc commented Mar 29, 2022

LOL, thanks @rfranzke.

(1) #255 was a band-aid from mid 2018 for the poor VPA v1 for the actual ticket...

(2) #16 that we couldn't implement with the former VPA v1 and so we did this, because we hit...

(3) #79, where KCM stopped working under load

cc @voelzmo we might want to rid is of all that scalingClass workarounds now with VPA v2, if we can make it work at scale and as we want it to work.

So funny/shocking how long this topic is now with us. I see Gaurav made the changes. He was the first of the 3 colleagues on the auto-scaling topic - not in parallel - in sequence over 4 years. :-)

Time to give it a new shot and improve or ideally get rid of all these hacks.

@stoyanr
Copy link
Contributor Author

stoyanr commented Mar 30, 2022

Time to give it a new shot and improve or ideally get rid of all these hacks.

@vlerenc I will take a look, and perhaps improve it in a separate PR.

@timebertt timebertt added do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. and removed reviewed/do-not-merge labels Mar 30, 2022
@gardener-prow gardener-prow bot added the cla: yes Indicates the PR's author has signed the cla-assistant.io CLA. label Mar 30, 2022
@vlerenc
Copy link
Member

vlerenc commented Mar 30, 2022

Thanks @stoyanr. I am just a bit cautious here as well.

The point is that it may still be necessary to have "decent" requests, "oriented" at the number/size of the nodes, so that we start in the correct ballpark and avoid lengthy up-scaling iterations with also the current VPAv2 (unless tuned). I say that because I remember many years before that e.g. waking up clusters (after hibernation) took very long to scale up and become operational again. Scaling classes "catapulted" control planes into the right ballpark. As long as VPAv2 is making poor recommendations or isn't bumping up faster, most of that code would probably still be required. We can only really rid ourselves of it once VPAv2 overcomes it current problems. At least, that's my gut feeling.

@stoyanr
Copy link
Contributor Author

stoyanr commented Mar 30, 2022

As long as VPAv2 is making poor recommendations or isn't bumping up faster, most of that code would probably still be required. We can only really rid ourselves of it once VPAv2 overcomes it current problems.

OK, in this case I will postpone this for later.

@stoyanr
Copy link
Contributor Author

stoyanr commented Mar 30, 2022

@vlerenc I examined the scalingClass implementation and played with it a bit, it seems that when HVPA is enabled (it's a feature gate in gardenlet that I believe is enabled by default), the scaling class is only really taken into account the first time the deployment is created. If the deployment already exists, the resources are copied from the existing deployment. This means that if the cluster started with a low initial number of nodes, increasing the number of nodes later will have no effect on the resources in the deployment. Hibernating / waking up will certainly have no effect either.

I am rather skeptical of this complex implementation (scaling class + HVPA), but as agreed, let's take another look and possibly change it once we are more confident of VPA recommendations.

@vlerenc
Copy link
Member

vlerenc commented Mar 30, 2022

@stoyanr My memory is from many years back with end users complaining in Slack and whatever you have found out today counts (not my memory), because we may have influenced the behaviour directly or indirectly many times over since 2018.

@gardener-prow gardener-prow bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Mar 30, 2022
@stoyanr stoyanr force-pushed the enh/remove-or-adjust-limits branch from 2dd8e90 to 372dc19 Compare March 30, 2022 14:50
@gardener-prow gardener-prow bot added approved Indicates a PR has been approved by an approver from all required OWNERS files. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Mar 30, 2022
@stoyanr
Copy link
Contributor Author

stoyanr commented Mar 30, 2022

@vlerenc @rfranzke The new etcd-druid release has been merged and I removed the few TODOs in the PR (now possible with the new etcd-druid version). I consider the PR ready to be merged. Could I get an approval or do you have any other open points?

Copy link
Member

@rfranzke rfranzke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
/approve
/hold

@vlerenc any objections? If not please /hold cancel so that the PR gets auto-merged by Prow.

@gardener-prow gardener-prow bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Mar 30, 2022
@gardener-prow gardener-prow bot added the lgtm Indicates that a PR is ready to be merged. label Mar 30, 2022
@gardener-prow
Copy link
Contributor

gardener-prow bot commented Mar 30, 2022

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: rfranzke

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@vlerenc
Copy link
Member

vlerenc commented Mar 31, 2022

/hold cancel
@stoyanr Thanks for the PR. 👍 Let's roll it our carefully now, watching for possible issues, increased node usage, etc. please.
@rfranzke I have not hold this one (in fact, I lgtm'd it, because that's what we want - I can only check from the "intention" level and not every line anyway/am no expert), but you wanted to double-check because of Prow's automation?

@gardener-prow gardener-prow bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Mar 31, 2022
@gardener-prow gardener-prow bot merged commit 710ca09 into gardener:master Mar 31, 2022
@rfranzke
Copy link
Member

I simply wasn't sure whether you are fine merging this PR now because there were quite some follow-up questions after your /lgtm and I couldn't follow everything in detail :)

krgostev pushed a commit to krgostev/gardener that referenced this pull request Apr 21, 2022
krgostev pushed a commit to krgostev/gardener that referenced this pull request Jul 5, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cla: yes Indicates the PR's author has signed the cla-assistant.io CLA. kind/enhancement Enhancement, improvement, extension lgtm Indicates that a PR is ready to be merged. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet