Add a CPU utilization resource monitor for overload manager #34713

cancecen · 2024-06-12T18:16:18Z

Commit Message: Add a CPU utilization resource monitor for overload manager
Additional Description: Adds a new resource monitor to be used in CPU bound workloads to shed load. i.e. this can be configured to reject requests once CPU Utilization reaches a certain brownout point.

In my company, we experience user driven retry storms and/or unexpected flash crowds that are mitigated via auto scaling of our compute resources. However, auto scaling takes time and we still want to keep our fleet reasonably healthy during auto scaling and having a way to cheaply reject requests via overload manager has been very helpful. As a platform owner, it is easier to configure limits for CPU bound workloads with this resource monitor than other RPS/Latency/Concurrency based limiters.

Risk Level: Low
Testing: Unit tests
Docs Changes:
Release Notes:
Platform Specific Features: Today this is only implemented for Linux

Signed-off-by: Can Cecen <ccecen@netflix.com>

repokitteh-read-only · 2024-06-12T18:16:28Z

CC @envoyproxy/api-shepherds: Your approval is needed for changes made to (api/envoy/|docs/root/api-docs/).
envoyproxy/api-shepherds assignee is @markdroth
CC @envoyproxy/api-watchers: FYI only for changes made to (api/envoy/|docs/root/api-docs/).

🐱

Caused by: #34713 was opened by cancecen.

see: more, trace.

Signed-off-by: Can Cecen <ccecen@netflix.com>

KBaichoo

Thank you for working on this!

/wait

api/envoy/extensions/resource_monitors/cpu_utilization/v3/cpu_utilization.proto

source/extensions/resource_monitors/cpu_utilization/linux_cpu_stats_reader.cc

source/extensions/resource_monitors/cpu_utilization/cpu_utilization_monitor.h

source/extensions/resource_monitors/cpu_utilization/config.cc

source/extensions/resource_monitors/cpu_utilization/cpu_utilization_monitor.cc

test/extensions/resource_monitors/cpu_utilization/linux_cpu_stats_reader_test.cc

Signed-off-by: Can Cecen <ccecen@netflix.com>

labilezhu · 2024-06-25T15:13:33Z

Some other ideas: We use cgroups cpu controller to limit CPU usage in a container environment. So maybe some cgroup metrics, eg. cpu.stat is a valuable resource monitor for overload manager.

KBaichoo

Thanks, lgtm except the two other pending comments, CI and we need release notes in changelogs/current.yaml.

KBaichoo · 2024-06-26T14:07:22Z

Some other ideas: We use cgroups cpu controller to limit CPU usage in a container environment. So maybe some cgroup metrics, eg. cpu.stat is a valuable resource monitor for overload manager.

@labilezhu agree this would be a good follow up to have a another source for the cpu monitor

/wait

github-actions · 2024-07-26T16:01:41Z

This pull request has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in 7 days if no further activity occurs. Please feel free to give a status update now, ping for review, or re-open when it's ready. Thank you for your contributions!

github-actions · 2024-08-29T20:01:08Z

This pull request has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in 7 days if no further activity occurs. Please feel free to give a status update now, ping for review, or re-open when it's ready. Thank you for your contributions!

Signed-off-by: Can Cecen <ccecen@netflix.com>

phlax · 2024-09-03T16:30:40Z

hi @cancecen - seems CI is failing

/wait

markdroth · 2024-09-05T16:47:55Z

/lgtm api

Signed-off-by: Can Cecen <ccecen@netflix.com>

phlax · 2024-09-10T20:02:06Z

/docs

repokitteh-read-only · 2024-09-10T20:02:10Z

Docs for this Pull Request will be rendered here:

https://storage.googleapis.com/envoy-pr/34713/docs/index.html

The docs are (re-)rendered each time the CI envoy-presubmit (precheck docs) job completes.

🐱

Caused by: a #34713 (comment) was created by @phlax.

see: more, trace.

cancecen · 2024-09-10T20:26:37Z

/docs

Thanks @phlax. Looks good to me, let me know what you think:
https://storage.googleapis.com/envoy-pr/2eeb3fe/docs/configuration/operations/overload_manager/overload_manager.html#cpu-intensive-workload-brownout-protection

phlax

nice! thanks

lgtm from docs pov - thanks for iterating

phlax · 2024-09-10T20:30:30Z

@KBaichoo would you mind doing another review

KBaichoo

Otherwise, lgtm. Thank you

KBaichoo · 2024-09-11T14:21:19Z

source/extensions/resource_monitors/cpu_utilization/linux_cpu_stats_reader.cc

+  std::string buffer(5, '\0');
+  cpu_stats_file.read(buffer.data(), 5);
+  const std::string target = "cpu  ";
+  if (!cpu_stats_file || !std::equal(buffer.begin(), buffer.end(), target.begin(), target.end())) {


you can just compare buffer != target

kyessenov · 2024-09-11T17:45:12Z

Somewhat related question: I see that you use kernel reported CPU usage here, but we use self-reported memory usage in fixed_heap. If we were consistent, we'd probably use cgroup2 /sys/fs/cgroup/memory.current for memory monitor. Is there some rationale why we choose kernel here but user-space monitor there?

Signed-off-by: Can Cecen <ccecen@netflix.com>

cancecen · 2024-09-11T18:17:41Z

Somewhat related question: I see that you use kernel reported CPU usage here, but we use self-reported memory usage in fixed_heap. If we were consistent, we'd probably use cgroup2 /sys/fs/cgroup/memory.current for memory monitor. Is there some rationale why we choose kernel here but user-space monitor there?

Yes. We use the kernel reported one because we really want to get the view of the whole system - i.e. there can be workloads Envoy is proxying requests running in the system that are actually compute intensive. This way we can make the decision based on that.

kyessenov · 2024-09-11T18:24:06Z

Somewhat related question: I see that you use kernel reported CPU usage here, but we use self-reported memory usage in fixed_heap. If we were consistent, we'd probably use cgroup2 /sys/fs/cgroup/memory.current for memory monitor. Is there some rationale why we choose kernel here but user-space monitor there?

Yes. We use the kernel reported one because we really want to get the view of the whole system - i.e. there can be workloads Envoy is proxying requests running in the system that are actually compute intensive. This way we can make the decision based on that.

Is there a reason why you are not using cpu.stat from cgroup2? That would work better on k8s, for example, where Envoy runs inside a container.

cancecen · 2024-09-11T18:28:08Z

Somewhat related question: I see that you use kernel reported CPU usage here, but we use self-reported memory usage in fixed_heap. If we were consistent, we'd probably use cgroup2 /sys/fs/cgroup/memory.current for memory monitor. Is there some rationale why we choose kernel here but user-space monitor there?

Yes. We use the kernel reported one because we really want to get the view of the whole system - i.e. there can be workloads Envoy is proxying requests running in the system that are actually compute intensive. This way we can make the decision based on that.

Is there a reason why you are not using cpu.stat from cgroup2? That would work better on k8s, for example, where Envoy runs inside a container.

That can be a follow up to this PR. Today in my company we have 3 different compute platforms and the majority of the workloads are in VMs. That's why I structured the code to have the possibility of separate cpu stats reader per platform.

cancecen · 2024-09-11T19:01:47Z

Somewhat related question: I see that you use kernel reported CPU usage here, but we use self-reported memory usage in fixed_heap. If we were consistent, we'd probably use cgroup2 /sys/fs/cgroup/memory.current for memory monitor. Is there some rationale why we choose kernel here but user-space monitor there?

Yes. We use the kernel reported one because we really want to get the view of the whole system - i.e. there can be workloads Envoy is proxying requests running in the system that are actually compute intensive. This way we can make the decision based on that.

Is there a reason why you are not using cpu.stat from cgroup2? That would work better on k8s, for example, where Envoy runs inside a container.

That can be a follow up to this PR. Today in my company we have 3 different compute platforms and the majority of the workloads are in VMs. That's why I structured the code to have the possibility of separate cpu stats reader per platform.

@kyessenov To add a little bit more here, we do have a k8s-like container platform within my company and I use cpu.stat to calculate the overall cpu utilization there. I'll upstream that in the next few weeks as I find time. No plans for Windows based systems as of now though.

cancecen · 2024-09-11T20:28:12Z

/retest

* upstream/main: (21 commits) Add a CPU utilization resource monitor for overload manager (envoyproxy#34713) jwks: Add UA string to headers (envoyproxy#35977) exceptions: cleaning up macros (envoyproxy#35694) coverage: ratcheting (envoyproxy#36058) runtime: load rtds bool correctly as true/false instead of 1/0 (envoyproxy#36044) Typo in documentation of http original_src filter (envoyproxy#36060) docs: updating meeting info (envoyproxy#36052) quic: removes more references to spdy::Http2HeaderBlock. (envoyproxy#36057) json: add null support to the streamer (envoyproxy#36051) json: make the streamer a template class (envoyproxy#36001) docs: Add `apt.envoyproxy.io` install information (envoyproxy#36050) ext_proc: elide redundant copy in ext_proc filter factory callback (envoyproxy#36015) build(deps): bump yarl from 1.11.0 to 1.11.1 in /tools/base (envoyproxy#36049) build(deps): bump multidict from 6.0.5 to 6.1.0 in /tools/base (envoyproxy#36048) quic: enable certificate compression/decompression (envoyproxy#35999) Geoip fix asan failure (envoyproxy#36043) mobile: Fix missing logging output in Swift integration tests (envoyproxy#36040) http: minor code clean up to the http filter manager (envoyproxy#36027) ci/example: Dont build/test the filter example in Envoy CI (envoyproxy#36038) ci/codeql: Fix build setup (envoyproxy#36021) ... Signed-off-by: Qiu Yu <qiuyu@apple.com>

cancecen and others added 3 commits June 6, 2024 16:56

CPU Utilization Monitor implementation

9ed9529

Signed-off-by: Can Cecen <ccecen@netflix.com>

Add tests.

7ee4b4d

Signed-off-by: Can Cecen <ccecen@netflix.com>

Merge branch 'envoyproxy:main' into main

159367e

repokitteh-read-only bot added the api label Jun 12, 2024

repokitteh-read-only bot assigned markdroth Jun 12, 2024

cancecen added 3 commits June 12, 2024 18:46

Add EWMA to spelling dictionary.

d2362af

Signed-off-by: Can Cecen <ccecen@netflix.com>

Update code owners.

90cace2

Signed-off-by: Can Cecen <ccecen@netflix.com>

Add factory test.

a848140

Signed-off-by: Can Cecen <ccecen@netflix.com>

alyssawilk assigned KBaichoo Jun 13, 2024

KBaichoo reviewed Jun 14, 2024

View reviewed changes

repokitteh-read-only bot added the waiting label Jun 14, 2024

PR feedback pt 1.

963b4a8

Signed-off-by: Can Cecen <ccecen@netflix.com>

repokitteh-read-only bot removed the waiting label Jun 24, 2024

KBaichoo reviewed Jun 26, 2024

View reviewed changes

repokitteh-read-only bot added the waiting label Jun 26, 2024

github-actions bot added the stale stalebot believes this issue/PR has not been touched recently label Jul 26, 2024

KBaichoo removed the stale stalebot believes this issue/PR has not been touched recently label Jul 30, 2024

github-actions bot added the stale stalebot believes this issue/PR has not been touched recently label Aug 29, 2024

Merge branch 'main' into main

1723460

Signed-off-by: Can Cecen <ccecen@netflix.com>

repokitteh-read-only bot removed the waiting label Aug 29, 2024

cancecen added 2 commits August 29, 2024 22:06

Fix indentation.

37bd550

Signed-off-by: Can Cecen <ccecen@netflix.com>

Add new line

5f6715c

Signed-off-by: Can Cecen <ccecen@netflix.com>

github-actions bot removed the stale stalebot believes this issue/PR has not been touched recently label Aug 30, 2024

repokitteh-read-only bot added the waiting label Sep 3, 2024

repokitteh-read-only bot removed the api label Sep 5, 2024

cancecen and others added 10 commits September 6, 2024 22:44

Add more validation and don't compile for Windows.

ed9c7ba

Signed-off-by: Can Cecen <ccecen@netflix.com>

Fix CI.

85e0458

Signed-off-by: Can Cecen <ccecen@netflix.com>

Fix tests.

e90f606

Signed-off-by: Can Cecen <ccecen@netflix.com>

Add boostrap config and link it in the docs.

6b6734f

Signed-off-by: Can Cecen <ccecen@netflix.com>

Fix doc format.

3f032c2

Signed-off-by: Can Cecen <ccecen@netflix.com>

Fix doc format.

4bfa9d8

Signed-off-by: Can Cecen <ccecen@netflix.com>

Fix refresh interval.

6eb4b18

Signed-off-by: Can Cecen <ccecen@netflix.com>

Merge branch 'envoyproxy:main' into main

5297492

Add release notes.

7cfae6a

Signed-off-by: Can Cecen <ccecen@netflix.com>

Fix line numbers.

a2fc0a2

Signed-off-by: Can Cecen <ccecen@netflix.com>

phlax previously approved these changes Sep 10, 2024

View reviewed changes

KBaichoo reviewed Sep 11, 2024

View reviewed changes

PR feedback.

f63d1dc

Signed-off-by: Can Cecen <ccecen@netflix.com>

cancecen dismissed phlax’s stale review via f63d1dc September 11, 2024 18:43

KBaichoo approved these changes Sep 11, 2024

View reviewed changes

KBaichoo enabled auto-merge (squash) September 11, 2024 18:49

KBaichoo merged commit 4d12162 into envoyproxy:main Sep 11, 2024
38 of 40 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a CPU utilization resource monitor for overload manager #34713

Add a CPU utilization resource monitor for overload manager #34713

cancecen commented Jun 12, 2024

repokitteh-read-only bot commented Jun 12, 2024

KBaichoo left a comment

labilezhu commented Jun 25, 2024

KBaichoo left a comment

KBaichoo commented Jun 26, 2024 •

edited

Loading

github-actions bot commented Jul 26, 2024

github-actions bot commented Aug 29, 2024

phlax commented Sep 3, 2024

markdroth commented Sep 5, 2024

phlax commented Sep 10, 2024

repokitteh-read-only bot commented Sep 10, 2024

cancecen commented Sep 10, 2024

phlax left a comment

phlax commented Sep 10, 2024

KBaichoo left a comment

KBaichoo Sep 11, 2024

kyessenov commented Sep 11, 2024

cancecen commented Sep 11, 2024

kyessenov commented Sep 11, 2024

cancecen commented Sep 11, 2024

cancecen commented Sep 11, 2024

cancecen commented Sep 11, 2024

Add a CPU utilization resource monitor for overload manager #34713

Add a CPU utilization resource monitor for overload manager #34713

Conversation

cancecen commented Jun 12, 2024

repokitteh-read-only bot commented Jun 12, 2024

KBaichoo left a comment

Choose a reason for hiding this comment

labilezhu commented Jun 25, 2024

KBaichoo left a comment

Choose a reason for hiding this comment

KBaichoo commented Jun 26, 2024 • edited Loading

github-actions bot commented Jul 26, 2024

github-actions bot commented Aug 29, 2024

phlax commented Sep 3, 2024

markdroth commented Sep 5, 2024

phlax commented Sep 10, 2024

repokitteh-read-only bot commented Sep 10, 2024

cancecen commented Sep 10, 2024

phlax left a comment

Choose a reason for hiding this comment

phlax commented Sep 10, 2024

KBaichoo left a comment

Choose a reason for hiding this comment

KBaichoo Sep 11, 2024

Choose a reason for hiding this comment

kyessenov commented Sep 11, 2024

cancecen commented Sep 11, 2024

kyessenov commented Sep 11, 2024

cancecen commented Sep 11, 2024

cancecen commented Sep 11, 2024

cancecen commented Sep 11, 2024

KBaichoo commented Jun 26, 2024 •

edited

Loading