Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add CPU Entitlement gauge metric & Deprecate CPU Entitlement counter metric #897

Closed
mkocher opened this issue Jan 22, 2024 · 10 comments
Closed

Comments

@mkocher
Copy link
Member

mkocher commented Jan 22, 2024

Summary

Since the beginning of time Cloud Foundry has a CPU metric which has represented the % of the entire host VM's CPU that a container is running. This number does not reflect that the container is sharing the host with a bunch of other containers.

A while ago a AbsoulteCPUEntitlement & AbsoluteCPUUsage metrics were added. This allowed astute users to be able to calculate a percentage of the entitlement being used. A CPU Entitlement plugin was produced to enable users to see this metric.

Upon seeing this metric users found it valuable and wanted to see it in more places and do more with it. When evaluating how to expose it in the Cloud Controller API we realized that while the counter has some advantages, it requires substantial calculations within Cloud Controller, and at least two metrics envelopes to calculate the delta.

Having validated the value and found problems with the approach, we'd like to:

  • add a cpu_entitlement percentage metric that would be a gauge much like the current cpu percentage metric
  • add a flag in diego to enable operators to continue emitting the old metrics for some period of time (default to off)
  • (not in Diego) add a flag in Cloud Controller to allow operators to chose between cpu and cpu_entitlement in the container metrics response

Diego repo

Describe alternatives you've considered (optional)

  • Utilizing the Logcache promql endpoint for doing this calculation. However the promql endpoint in logcache contains a fork of promql code that is unmaintainable
  • Having Cloud Controller or the CF CLI retrieve multiple metric envelopes from Logcache and calculate the rate. This seems like a lot of complexity
@chombium
Copy link

chombium commented Jan 23, 2024

@mkocher I find this change as a must and as a great enhancement. It doesn't make sense to show the app's CPU usage as percentage from the CPU available to the whole VM. It would be much better to show how much of the entitled/available CPU for the app is being used at the moment.

Here is an example from an app which we have running in one of our foundations in which the differences can be clearly seen:

cf app

cf app cf-app-monitoring
Showing health and status for app cf-app-monitoring in org <reducted> / space <reducted> as <reducted>...

name:              cf-app-monitoring
requested state:   started
routes:            cf-app-monitoring.<reducted>
last uploaded:     Thu 23 Mar 13:47:59 UTC 2023
stack:             cflinuxfs4
buildpacks:
        name                   version   detect output   buildpack name
        staticfile_buildpack   1.6.0     staticfile      staticfile

type:           web
sidecars:
instances:      2/2
memory usage:   64M
     state     since                                    cpu     memory            disk              logging                        details
#0   running   2024-01-12T17:01:09Z   1.1%   15.6M of 64M   5.2M of 1G   0/s of unlimited
#1   running   2024-01-12T17:43:15Z   1.1%   15.7M of 64M   5.2M of 1G   0/s of unlimited

cf cpu-entitlement

 cf cpu-entitlement cf-app-monitoring
Note: This plugin is experimental.
Showing CPU usage against entitlement for app cf-app-monitoring in org <reducted> / space <reducted> as <reducted>...

     avg usage   curr usage
#0   55.98%      54.78%
#1   58.97%      57.16%

WARNING: Instance #0 was over entitlement from 2024-01-12 17:01:11 to 2024-01-12 17:01:26
WARNING: Instance #1 was over entitlement from 2024-01-12 17:43:23 to 2024-01-12 17:44:23

We should be careful about this change when rolling it out as this would be a breaking change if we stop emitting the current metric by default. We should be loud when announcing this and provide ops files in cf-deployment for activating and switching configuration.

@mkocher
Copy link
Member Author

mkocher commented Jan 23, 2024

👍 glad to hear you're in favor

Agreed we need to make this backwards compatible, though I'd prefer to turn off the old metrics by default sooner than later. I don't think many people look at them, and container metrics generate a ton of individual time series which can put a burden on some metric stores.

@mkocher
Copy link
Member Author

mkocher commented Jan 23, 2024

We checked App Autoscaler Release and searched for absolute_entitlement and absolute_usage and got no results. So we think this is safe from that perspective.

@PlamenDoychev
Copy link

PlamenDoychev commented Jan 24, 2024

Dear @mkocher, @chombium, as far as i remember 'cf app' metric shows the percentage the container is currently using from a single CPU core, but not from the entire host VM's CPU.

I.e. if we take the @chombium 's example above the app is currently consuming 1.1% from a single host CPU.

On our CF deployments we allow CPU burst, in this case if the application is using more CPU we have seen this metric to spike up to several hundreds %. Like for example 300%, in this case the container is consuming 3 CPU out of all available on the host. In general the max value this metric can produce is: (100*N)% where N is the number of CPU cores the host VM has. This metric is an easy way to see if the application is currently bursting when debugging.

The CPUEntitlement metric is really a good one, but it has different semantic it shows where the container is positioned with its average/current CPU consumption according to what it is entitled to. Also have in mind that the first metric comes for free while for the 'cf cpu' you need to install a cf cli plugin.

@mkocher
Copy link
Member Author

mkocher commented Jan 24, 2024

Yep, the current cpu metric is out of 100*NumberOfCores, not 100. I'm not sure why we do that as an industry, but it is the convention. Apps however aren't allocated cores, they're allocated shares. So using more than 100% doesn't indicate one way or the other if the app is bursting.

@winkingturtle-vmw
Copy link
Contributor

@mkocher CI is failing with rep-spec windows diversion. Should this be applied to rep-windows too?

@mkocher
Copy link
Member Author

mkocher commented Jan 29, 2024

Oops. It has been 0️⃣ days since we forgot about windows.

As far as we can tell this should be applied verbatim to Windows as well. We'll take a look.

@mkocher
Copy link
Member Author

mkocher commented Jan 29, 2024

#901 fixes the windows issue. It also makes Diego releasable again as it make the change non-breaking.

@winkingturtle-vmw
Copy link
Contributor

@TimGerlach
Copy link

Will the official container metrics documentation still be updated?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants