-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add a CPU utilization resource monitor for overload manager #34713
Conversation
Signed-off-by: Can Cecen <ccecen@netflix.com>
Signed-off-by: Can Cecen <ccecen@netflix.com>
CC @envoyproxy/api-shepherds: Your approval is needed for changes made to |
Signed-off-by: Can Cecen <ccecen@netflix.com>
Signed-off-by: Can Cecen <ccecen@netflix.com>
Signed-off-by: Can Cecen <ccecen@netflix.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for working on this!
/wait
api/envoy/extensions/resource_monitors/cpu_utilization/v3/cpu_utilization.proto
Outdated
Show resolved
Hide resolved
source/extensions/resource_monitors/cpu_utilization/linux_cpu_stats_reader.cc
Outdated
Show resolved
Hide resolved
source/extensions/resource_monitors/cpu_utilization/linux_cpu_stats_reader.cc
Outdated
Show resolved
Hide resolved
source/extensions/resource_monitors/cpu_utilization/cpu_utilization_monitor.h
Outdated
Show resolved
Hide resolved
source/extensions/resource_monitors/cpu_utilization/cpu_utilization_monitor.cc
Outdated
Show resolved
Hide resolved
source/extensions/resource_monitors/cpu_utilization/cpu_utilization_monitor.cc
Outdated
Show resolved
Hide resolved
source/extensions/resource_monitors/cpu_utilization/cpu_utilization_monitor.cc
Outdated
Show resolved
Hide resolved
test/extensions/resource_monitors/cpu_utilization/linux_cpu_stats_reader_test.cc
Outdated
Show resolved
Hide resolved
test/extensions/resource_monitors/cpu_utilization/linux_cpu_stats_reader_test.cc
Outdated
Show resolved
Hide resolved
Signed-off-by: Can Cecen <ccecen@netflix.com>
Some other ideas: We use cgroups cpu controller to limit CPU usage in a container environment. So maybe some cgroup metrics, eg. cpu.stat is a valuable resource monitor for overload manager. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, lgtm except the two other pending comments, CI and we need release notes in changelogs/current.yaml
.
Some other ideas: We use cgroups cpu controller to limit CPU usage in a container environment. So maybe some cgroup metrics, eg. cpu.stat is a valuable resource monitor for overload manager.
/wait |
This pull request has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in 7 days if no further activity occurs. Please feel free to give a status update now, ping for review, or re-open when it's ready. Thank you for your contributions! |
This pull request has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in 7 days if no further activity occurs. Please feel free to give a status update now, ping for review, or re-open when it's ready. Thank you for your contributions! |
Signed-off-by: Can Cecen <ccecen@netflix.com>
Signed-off-by: Can Cecen <ccecen@netflix.com>
Signed-off-by: Can Cecen <ccecen@netflix.com>
hi @cancecen - seems CI is failing /wait |
/lgtm api |
Signed-off-by: Can Cecen <ccecen@netflix.com>
Signed-off-by: Can Cecen <ccecen@netflix.com>
Signed-off-by: Can Cecen <ccecen@netflix.com>
Signed-off-by: Can Cecen <ccecen@netflix.com>
Signed-off-by: Can Cecen <ccecen@netflix.com>
Signed-off-by: Can Cecen <ccecen@netflix.com>
Signed-off-by: Can Cecen <ccecen@netflix.com>
Signed-off-by: Can Cecen <ccecen@netflix.com>
/docs |
Docs for this Pull Request will be rendered here: https://storage.googleapis.com/envoy-pr/34713/docs/index.html The docs are (re-)rendered each time the CI |
Thanks @phlax. Looks good to me, let me know what you think: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice! thanks
lgtm from docs pov - thanks for iterating
@KBaichoo would you mind doing another review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Otherwise, lgtm. Thank you
std::string buffer(5, '\0'); | ||
cpu_stats_file.read(buffer.data(), 5); | ||
const std::string target = "cpu "; | ||
if (!cpu_stats_file || !std::equal(buffer.begin(), buffer.end(), target.begin(), target.end())) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can just compare buffer != target
Somewhat related question: I see that you use kernel reported CPU usage here, but we use self-reported memory usage in |
Signed-off-by: Can Cecen <ccecen@netflix.com>
Yes. We use the kernel reported one because we really want to get the view of the whole system - i.e. there can be workloads Envoy is proxying requests running in the system that are actually compute intensive. This way we can make the decision based on that. |
Is there a reason why you are not using |
That can be a follow up to this PR. Today in my company we have 3 different compute platforms and the majority of the workloads are in VMs. That's why I structured the code to have the possibility of separate cpu stats reader per platform. |
@kyessenov To add a little bit more here, we do have a k8s-like container platform within my company and I use |
/retest |
* upstream/main: (21 commits) Add a CPU utilization resource monitor for overload manager (envoyproxy#34713) jwks: Add UA string to headers (envoyproxy#35977) exceptions: cleaning up macros (envoyproxy#35694) coverage: ratcheting (envoyproxy#36058) runtime: load rtds bool correctly as true/false instead of 1/0 (envoyproxy#36044) Typo in documentation of http original_src filter (envoyproxy#36060) docs: updating meeting info (envoyproxy#36052) quic: removes more references to spdy::Http2HeaderBlock. (envoyproxy#36057) json: add null support to the streamer (envoyproxy#36051) json: make the streamer a template class (envoyproxy#36001) docs: Add `apt.envoyproxy.io` install information (envoyproxy#36050) ext_proc: elide redundant copy in ext_proc filter factory callback (envoyproxy#36015) build(deps): bump yarl from 1.11.0 to 1.11.1 in /tools/base (envoyproxy#36049) build(deps): bump multidict from 6.0.5 to 6.1.0 in /tools/base (envoyproxy#36048) quic: enable certificate compression/decompression (envoyproxy#35999) Geoip fix asan failure (envoyproxy#36043) mobile: Fix missing logging output in Swift integration tests (envoyproxy#36040) http: minor code clean up to the http filter manager (envoyproxy#36027) ci/example: Dont build/test the filter example in Envoy CI (envoyproxy#36038) ci/codeql: Fix build setup (envoyproxy#36021) ... Signed-off-by: Qiu Yu <qiuyu@apple.com>
Commit Message: Add a CPU utilization resource monitor for overload manager
Additional Description: Adds a new resource monitor to be used in CPU bound workloads to shed load. i.e. this can be configured to reject requests once CPU Utilization reaches a certain brownout point.
In my company, we experience user driven retry storms and/or unexpected flash crowds that are mitigated via auto scaling of our compute resources. However, auto scaling takes time and we still want to keep our fleet reasonably healthy during auto scaling and having a way to cheaply reject requests via overload manager has been very helpful. As a platform owner, it is easier to configure limits for CPU bound workloads with this resource monitor than other RPS/Latency/Concurrency based limiters.
Risk Level: Low
Testing: Unit tests
Docs Changes:
Release Notes:
Platform Specific Features: Today this is only implemented for Linux