Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a CPU utilization resource monitor for overload manager #34713

Merged
merged 22 commits into from
Sep 11, 2024

Conversation

cancecen
Copy link
Contributor

Commit Message: Add a CPU utilization resource monitor for overload manager
Additional Description: Adds a new resource monitor to be used in CPU bound workloads to shed load. i.e. this can be configured to reject requests once CPU Utilization reaches a certain brownout point.

In my company, we experience user driven retry storms and/or unexpected flash crowds that are mitigated via auto scaling of our compute resources. However, auto scaling takes time and we still want to keep our fleet reasonably healthy during auto scaling and having a way to cheaply reject requests via overload manager has been very helpful. As a platform owner, it is easier to configure limits for CPU bound workloads with this resource monitor than other RPS/Latency/Concurrency based limiters.

Risk Level: Low
Testing: Unit tests
Docs Changes:
Release Notes:
Platform Specific Features: Today this is only implemented for Linux

cancecen and others added 3 commits June 6, 2024 16:56
Signed-off-by: Can Cecen <ccecen@netflix.com>
Signed-off-by: Can Cecen <ccecen@netflix.com>
Copy link

CC @envoyproxy/api-shepherds: Your approval is needed for changes made to (api/envoy/|docs/root/api-docs/).
envoyproxy/api-shepherds assignee is @markdroth
CC @envoyproxy/api-watchers: FYI only for changes made to (api/envoy/|docs/root/api-docs/).

🐱

Caused by: #34713 was opened by cancecen.

see: more, trace.

Signed-off-by: Can Cecen <ccecen@netflix.com>
Signed-off-by: Can Cecen <ccecen@netflix.com>
Signed-off-by: Can Cecen <ccecen@netflix.com>
Copy link
Contributor

@KBaichoo KBaichoo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for working on this!

/wait

Signed-off-by: Can Cecen <ccecen@netflix.com>
@labilezhu
Copy link
Contributor

Some other ideas: We use cgroups cpu controller to limit CPU usage in a container environment. So maybe some cgroup metrics, eg. cpu.stat is a valuable resource monitor for overload manager.

Copy link
Contributor

@KBaichoo KBaichoo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, lgtm except the two other pending comments, CI and we need release notes in changelogs/current.yaml.

@KBaichoo
Copy link
Contributor

KBaichoo commented Jun 26, 2024

Some other ideas: We use cgroups cpu controller to limit CPU usage in a container environment. So maybe some cgroup metrics, eg. cpu.stat is a valuable resource monitor for overload manager.

@labilezhu agree this would be a good follow up to have a another source for the cpu monitor

/wait

Copy link

This pull request has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in 7 days if no further activity occurs. Please feel free to give a status update now, ping for review, or re-open when it's ready. Thank you for your contributions!

@github-actions github-actions bot added the stale stalebot believes this issue/PR has not been touched recently label Jul 26, 2024
@KBaichoo KBaichoo removed the stale stalebot believes this issue/PR has not been touched recently label Jul 30, 2024
Copy link

This pull request has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in 7 days if no further activity occurs. Please feel free to give a status update now, ping for review, or re-open when it's ready. Thank you for your contributions!

@github-actions github-actions bot added the stale stalebot believes this issue/PR has not been touched recently label Aug 29, 2024
Signed-off-by: Can Cecen <ccecen@netflix.com>
Signed-off-by: Can Cecen <ccecen@netflix.com>
Signed-off-by: Can Cecen <ccecen@netflix.com>
@github-actions github-actions bot removed the stale stalebot believes this issue/PR has not been touched recently label Aug 30, 2024
@phlax
Copy link
Member

phlax commented Sep 3, 2024

hi @cancecen - seems CI is failing

/wait

@markdroth
Copy link
Contributor

/lgtm api

@repokitteh-read-only repokitteh-read-only bot removed the api label Sep 5, 2024
cancecen and others added 10 commits September 6, 2024 22:44
Signed-off-by: Can Cecen <ccecen@netflix.com>
Signed-off-by: Can Cecen <ccecen@netflix.com>
Signed-off-by: Can Cecen <ccecen@netflix.com>
Signed-off-by: Can Cecen <ccecen@netflix.com>
Signed-off-by: Can Cecen <ccecen@netflix.com>
Signed-off-by: Can Cecen <ccecen@netflix.com>
Signed-off-by: Can Cecen <ccecen@netflix.com>
Signed-off-by: Can Cecen <ccecen@netflix.com>
Signed-off-by: Can Cecen <ccecen@netflix.com>
@phlax
Copy link
Member

phlax commented Sep 10, 2024

/docs

Copy link

Docs for this Pull Request will be rendered here:

https://storage.googleapis.com/envoy-pr/34713/docs/index.html

The docs are (re-)rendered each time the CI envoy-presubmit (precheck docs) job completes.

🐱

Caused by: a #34713 (comment) was created by @phlax.

see: more, trace.

@cancecen
Copy link
Contributor Author

phlax
phlax previously approved these changes Sep 10, 2024
Copy link
Member

@phlax phlax left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice! thanks

lgtm from docs pov - thanks for iterating

@phlax
Copy link
Member

phlax commented Sep 10, 2024

@KBaichoo would you mind doing another review

Copy link
Contributor

@KBaichoo KBaichoo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Otherwise, lgtm. Thank you

std::string buffer(5, '\0');
cpu_stats_file.read(buffer.data(), 5);
const std::string target = "cpu ";
if (!cpu_stats_file || !std::equal(buffer.begin(), buffer.end(), target.begin(), target.end())) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can just compare buffer != target

@kyessenov
Copy link
Contributor

Somewhat related question: I see that you use kernel reported CPU usage here, but we use self-reported memory usage in fixed_heap. If we were consistent, we'd probably use cgroup2 /sys/fs/cgroup/memory.current for memory monitor. Is there some rationale why we choose kernel here but user-space monitor there?

Signed-off-by: Can Cecen <ccecen@netflix.com>
@cancecen
Copy link
Contributor Author

Somewhat related question: I see that you use kernel reported CPU usage here, but we use self-reported memory usage in fixed_heap. If we were consistent, we'd probably use cgroup2 /sys/fs/cgroup/memory.current for memory monitor. Is there some rationale why we choose kernel here but user-space monitor there?

Yes. We use the kernel reported one because we really want to get the view of the whole system - i.e. there can be workloads Envoy is proxying requests running in the system that are actually compute intensive. This way we can make the decision based on that.

@kyessenov
Copy link
Contributor

Somewhat related question: I see that you use kernel reported CPU usage here, but we use self-reported memory usage in fixed_heap. If we were consistent, we'd probably use cgroup2 /sys/fs/cgroup/memory.current for memory monitor. Is there some rationale why we choose kernel here but user-space monitor there?

Yes. We use the kernel reported one because we really want to get the view of the whole system - i.e. there can be workloads Envoy is proxying requests running in the system that are actually compute intensive. This way we can make the decision based on that.

Is there a reason why you are not using cpu.stat from cgroup2? That would work better on k8s, for example, where Envoy runs inside a container.

@cancecen
Copy link
Contributor Author

Somewhat related question: I see that you use kernel reported CPU usage here, but we use self-reported memory usage in fixed_heap. If we were consistent, we'd probably use cgroup2 /sys/fs/cgroup/memory.current for memory monitor. Is there some rationale why we choose kernel here but user-space monitor there?

Yes. We use the kernel reported one because we really want to get the view of the whole system - i.e. there can be workloads Envoy is proxying requests running in the system that are actually compute intensive. This way we can make the decision based on that.

Is there a reason why you are not using cpu.stat from cgroup2? That would work better on k8s, for example, where Envoy runs inside a container.

That can be a follow up to this PR. Today in my company we have 3 different compute platforms and the majority of the workloads are in VMs. That's why I structured the code to have the possibility of separate cpu stats reader per platform.

@KBaichoo KBaichoo enabled auto-merge (squash) September 11, 2024 18:49
@cancecen
Copy link
Contributor Author

Somewhat related question: I see that you use kernel reported CPU usage here, but we use self-reported memory usage in fixed_heap. If we were consistent, we'd probably use cgroup2 /sys/fs/cgroup/memory.current for memory monitor. Is there some rationale why we choose kernel here but user-space monitor there?

Yes. We use the kernel reported one because we really want to get the view of the whole system - i.e. there can be workloads Envoy is proxying requests running in the system that are actually compute intensive. This way we can make the decision based on that.

Is there a reason why you are not using cpu.stat from cgroup2? That would work better on k8s, for example, where Envoy runs inside a container.

That can be a follow up to this PR. Today in my company we have 3 different compute platforms and the majority of the workloads are in VMs. That's why I structured the code to have the possibility of separate cpu stats reader per platform.

@kyessenov To add a little bit more here, we do have a k8s-like container platform within my company and I use cpu.stat to calculate the overall cpu utilization there. I'll upstream that in the next few weeks as I find time. No plans for Windows based systems as of now though.

@cancecen
Copy link
Contributor Author

/retest

@KBaichoo KBaichoo merged commit 4d12162 into envoyproxy:main Sep 11, 2024
38 of 40 checks passed
unicell added a commit to unicell/envoy that referenced this pull request Sep 11, 2024
* upstream/main: (21 commits)
  Add a CPU utilization resource monitor for overload manager (envoyproxy#34713)
  jwks: Add UA string to headers (envoyproxy#35977)
  exceptions: cleaning up macros (envoyproxy#35694)
  coverage: ratcheting (envoyproxy#36058)
  runtime: load rtds bool correctly as true/false instead of 1/0 (envoyproxy#36044)
  Typo in documentation of http original_src filter (envoyproxy#36060)
  docs: updating meeting info (envoyproxy#36052)
  quic: removes more references to spdy::Http2HeaderBlock. (envoyproxy#36057)
  json: add null support to the streamer (envoyproxy#36051)
  json: make the streamer a template class (envoyproxy#36001)
  docs: Add `apt.envoyproxy.io` install information (envoyproxy#36050)
  ext_proc: elide redundant copy in ext_proc filter factory callback (envoyproxy#36015)
  build(deps): bump yarl from 1.11.0 to 1.11.1 in /tools/base (envoyproxy#36049)
  build(deps): bump multidict from 6.0.5 to 6.1.0 in /tools/base (envoyproxy#36048)
  quic: enable certificate compression/decompression (envoyproxy#35999)
  Geoip fix asan failure (envoyproxy#36043)
  mobile: Fix missing logging output in Swift integration tests (envoyproxy#36040)
  http: minor code clean up to the http filter manager (envoyproxy#36027)
  ci/example: Dont build/test the filter example in Envoy CI (envoyproxy#36038)
  ci/codeql: Fix build setup (envoyproxy#36021)
  ...

Signed-off-by: Qiu Yu <qiuyu@apple.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants