Add CPU Entitlement gauge metric & Deprecate CPU Entitlement counter metric #897

mkocher · 2024-01-22T23:53:16Z

Summary

Since the beginning of time Cloud Foundry has a CPU metric which has represented the % of the entire host VM's CPU that a container is running. This number does not reflect that the container is sharing the host with a bunch of other containers.

A while ago a AbsoulteCPUEntitlement & AbsoluteCPUUsage metrics were added. This allowed astute users to be able to calculate a percentage of the entitlement being used. A CPU Entitlement plugin was produced to enable users to see this metric.

Upon seeing this metric users found it valuable and wanted to see it in more places and do more with it. When evaluating how to expose it in the Cloud Controller API we realized that while the counter has some advantages, it requires substantial calculations within Cloud Controller, and at least two metrics envelopes to calculate the delta.

Having validated the value and found problems with the approach, we'd like to:

add a cpu_entitlement percentage metric that would be a gauge much like the current cpu percentage metric
add a flag in diego to enable operators to continue emitting the old metrics for some period of time (default to off)
(not in Diego) add a flag in Cloud Controller to allow operators to chose between cpu and cpu_entitlement in the container metrics response

Diego repo

Describe alternatives you've considered (optional)

Utilizing the Logcache promql endpoint for doing this calculation. However the promql endpoint in logcache contains a fork of promql code that is unmaintainable
Having Cloud Controller or the CF CLI retrieve multiple metric envelopes from Logcache and calculate the rate. This seems like a lot of complexity

chombium · 2024-01-23T10:28:09Z

@mkocher I find this change as a must and as a great enhancement. It doesn't make sense to show the app's CPU usage as percentage from the CPU available to the whole VM. It would be much better to show how much of the entitled/available CPU for the app is being used at the moment.

Here is an example from an app which we have running in one of our foundations in which the differences can be clearly seen:

cf app

cf app cf-app-monitoring
Showing health and status for app cf-app-monitoring in org <reducted> / space <reducted> as <reducted>...

name:              cf-app-monitoring
requested state:   started
routes:            cf-app-monitoring.<reducted>
last uploaded:     Thu 23 Mar 13:47:59 UTC 2023
stack:             cflinuxfs4
buildpacks:
        name                   version   detect output   buildpack name
        staticfile_buildpack   1.6.0     staticfile      staticfile

type:           web
sidecars:
instances:      2/2
memory usage:   64M
     state     since                                    cpu     memory            disk              logging                        details
#0   running   2024-01-12T17:01:09Z   1.1%   15.6M of 64M   5.2M of 1G   0/s of unlimited
#1   running   2024-01-12T17:43:15Z   1.1%   15.7M of 64M   5.2M of 1G   0/s of unlimited

cf cpu-entitlement

 cf cpu-entitlement cf-app-monitoring
Note: This plugin is experimental.
Showing CPU usage against entitlement for app cf-app-monitoring in org <reducted> / space <reducted> as <reducted>...

     avg usage   curr usage
#0   55.98%      54.78%
#1   58.97%      57.16%

WARNING: Instance #0 was over entitlement from 2024-01-12 17:01:11 to 2024-01-12 17:01:26
WARNING: Instance #1 was over entitlement from 2024-01-12 17:43:23 to 2024-01-12 17:44:23

We should be careful about this change when rolling it out as this would be a breaking change if we stop emitting the current metric by default. We should be loud when announcing this and provide ops files in cf-deployment for activating and switching configuration.

mkocher · 2024-01-23T18:04:41Z

👍 glad to hear you're in favor

Agreed we need to make this backwards compatible, though I'd prefer to turn off the old metrics by default sooner than later. I don't think many people look at them, and container metrics generate a ton of individual time series which can put a burden on some metric stores.

mkocher · 2024-01-23T18:05:55Z

We checked App Autoscaler Release and searched for absolute_entitlement and absolute_usage and got no results. So we think this is safe from that perspective.

PlamenDoychev · 2024-01-24T06:38:32Z

Dear @mkocher, @chombium, as far as i remember 'cf app' metric shows the percentage the container is currently using from a single CPU core, but not from the entire host VM's CPU.

I.e. if we take the @chombium 's example above the app is currently consuming 1.1% from a single host CPU.

On our CF deployments we allow CPU burst, in this case if the application is using more CPU we have seen this metric to spike up to several hundreds %. Like for example 300%, in this case the container is consuming 3 CPU out of all available on the host. In general the max value this metric can produce is: (100*N)% where N is the number of CPU cores the host VM has. This metric is an easy way to see if the application is currently bursting when debugging.

The CPUEntitlement metric is really a good one, but it has different semantic it shows where the container is positioned with its average/current CPU consumption according to what it is entitled to. Also have in mind that the first metric comes for free while for the 'cf cpu' you need to install a cf cli plugin.

mkocher · 2024-01-24T19:34:05Z

Yep, the current cpu metric is out of 100*NumberOfCores, not 100. I'm not sure why we do that as an industry, but it is the convention. Apps however aren't allocated cores, they're allocated shares. So using more than 100% doesn't indicate one way or the other if the app is bursting.

winkingturtle-vmw · 2024-01-29T18:57:29Z

@mkocher CI is failing with rep-spec windows diversion. Should this be applied to rep-windows too?

mkocher · 2024-01-29T22:04:07Z

Oops. It has been 0️⃣ days since we forgot about windows.

As far as we can tell this should be applied verbatim to Windows as well. We'll take a look.

mkocher · 2024-01-29T23:40:05Z

#901 fixes the windows issue. It also makes Diego releasable again as it make the change non-breaking.

winkingturtle-vmw · 2024-02-09T13:21:25Z

Released in https://github.com/cloudfoundry/diego-release/releases/tag/v2.93.0

TimGerlach · 2024-02-22T16:10:05Z

Will the official container metrics documentation still be updated?

mkocher added the enhancement label Jan 22, 2024

This was referenced Jan 25, 2024

Add cpu_entitlement metric, allow filtering cloudfoundry/diego-logging-client#109

Merged

Emit CPU Entitlement Percentage Metric cloudfoundry/executor#92

Merged

Add cpu_entitlement metric #900

Merged

acrmp mentioned this issue Jan 29, 2024

Exclude cpu entitlement metric by default #901

Merged

3 tasks

acrmp mentioned this issue Jan 31, 2024

Emit easier to consume cpu_entitlement metric cloudfoundry/cf-deployment#1164

Merged

10 tasks

winkingturtle-vmw closed this as completed Feb 9, 2024

acrmp mentioned this issue Feb 15, 2024

Expose process CPU Entitlement in stats cloudfoundry/cloud_controller_ng#3641

Merged

5 tasks

acrmp mentioned this issue Mar 15, 2024

Output CPU Entitlement metrics rather than the old CPU metric cloudfoundry/cli#2812

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add CPU Entitlement gauge metric & Deprecate CPU Entitlement counter metric #897

Add CPU Entitlement gauge metric & Deprecate CPU Entitlement counter metric #897

mkocher commented Jan 22, 2024

chombium commented Jan 23, 2024 •

edited

mkocher commented Jan 23, 2024

mkocher commented Jan 23, 2024

PlamenDoychev commented Jan 24, 2024 •

edited

mkocher commented Jan 24, 2024

winkingturtle-vmw commented Jan 29, 2024

mkocher commented Jan 29, 2024

mkocher commented Jan 29, 2024

winkingturtle-vmw commented Feb 9, 2024

TimGerlach commented Feb 22, 2024

Add CPU Entitlement gauge metric & Deprecate CPU Entitlement counter metric #897

Add CPU Entitlement gauge metric & Deprecate CPU Entitlement counter metric #897

Comments

mkocher commented Jan 22, 2024

Summary

Diego repo

Describe alternatives you've considered (optional)

chombium commented Jan 23, 2024 • edited

mkocher commented Jan 23, 2024

mkocher commented Jan 23, 2024

PlamenDoychev commented Jan 24, 2024 • edited

mkocher commented Jan 24, 2024

winkingturtle-vmw commented Jan 29, 2024

mkocher commented Jan 29, 2024

mkocher commented Jan 29, 2024

winkingturtle-vmw commented Feb 9, 2024

TimGerlach commented Feb 22, 2024

chombium commented Jan 23, 2024 •

edited

PlamenDoychev commented Jan 24, 2024 •

edited