Align Observability Infrastructure Host CPU usage calculation with dashboard [Metrics System] Host Overview CPU Used calculation #182335

jeanfabrice · 2024-05-02T09:02:24Z

Describe the feature:

In Observability Infrastructure Host, CPU usage is calculated as:

(average(system.cpu.user.pct) + average(system.cpu.system.pct)) / max(system.cpu.cores)

whereas in the "[Metrics System] Host Overview" dashboard, the "CPU Used" indicator is directly using system.cpu.total.norm.pct

Describe a specific use case for the feature:
Make dashboards coherent with each other.
Make sure indicators representing the same thing use the same metrics to calculate the CPU Usage.

If we decide to go with the average of CPU time (Observability Infrastructure Host method), we should probably include system.cpu.nice.pct, system.cpu.irq.pct, system.cpu.softirq.pct and system.cpu.iowait.pct as well

If we decide to go with system.cpu.total.norm.pct, what about iowait CPU time, as system.cpu.total.norm.pct seems to exclude it:

The percentage of CPU time spent in states other than Idle and IOWait.

The text was updated successfully, but these errors were encountered:

elasticmachine · 2024-05-02T09:16:20Z

Pinging @elastic/obs-ux-infra_services-team (Team:obs-ux-infra_services)

felixbarny · 2024-05-03T12:36:42Z

If we decide to go with system.cpu.total.norm.pct, what about iowait CPU time, as system.cpu.total.norm.pct seems to exclude it:

IO wait is also an idle state. So, if you want to calculate the CPU usage, you sum up all states except for the idle states idle and iowait.

roshan-elastic · 2024-05-07T11:21:40Z

Thank you for this issue.

At the moment, we don't have short-term plans to change this calculation as we have dependencies throughout the UI on the current formula (e.g. related inventory alerting rules for 'CPU Usage') that need to manage.

However, in the medium-term, we do plan to solve for this by having some kind of a user-configurable 'metric library' which would enable us to have support different metrics. With something like this, we could preserve the existing behaviour for those who rely on it (e.g. for alerts) but allow us to promote newer metrics with better definitions.

We're not sure on timelines at the moment but we're aiming to solve for this in the next 6-12 months.

andrewvc · 2024-06-06T23:09:49Z

One thing that jumps out at me here is that our language around "CPU Usage" is just imprecise. There are multiple ways to calculate CPU usage; see this reference for details. For instance, it's not obvious to me that we should factor in steal in an alerting scenario, or if we should factor nice in for that matter.

@felixbarny you're right that just excluding wait states indicates how much of the CPU is busy, but what if the user is asking a different question, basically, how much of the app am "I" (speaking loosely here, since system is shared) using, where system + user is maybe a better fit?

I'd propose that rather than redefine the CPU Usage alert we give it a default behavior, which could be its current behavior, and add an additional control that lets you change the way its calculated to whichever of the CPU fields the user seems to think is most useful. Perhaps this should be hidden a little in the UI as an advanced setting.

Curious to hear what others think!

felixbarny · 2024-06-07T07:05:35Z

@felixbarny you're right that just excluding wait states indicates how much of the CPU is busy, but what if the user is asking a different question, basically, how much of the app am "I" (speaking loosely here, since system is shared) using, where system + user is maybe a better fit?

Fair point, but I think that mostly applies to when looking at the CPU usage of a process. If you're looking at the CPU utilization for the whole host (such as what the host UI is doing), I'm not sure if it makes sense to exclude certain cpu states. I think a typical user would expect the CPU usage to range from 0-100%, and would want to be alerted when the usage is above a certain threshold. If the host experiences a lot of steal time, that takes away resources for my applications, and I'd like to be alerted when the total utilization reaches a certain threshold.

Aside from the question about the different CPU states, another issue with the formula is that the division by max(system.cpu.cores), which is supposed to normalize the utilization to the range of 0-1 isn't quite correct. When looking at multiple hosts at once, it uses the number of cores of the host with the most CPU cores. So the normalization only works when all hosts have the same number of cores. I think we should use the *.norm.pct equivalents that are already normalized.

botelastic bot added the needs-team Issues missing a team label label May 2, 2024

jeanfabrice added the Team:obs-ux-infra_services Observability Infrastructure & Services User Experience Team label May 2, 2024

botelastic bot removed the needs-team Issues missing a team label label May 2, 2024

smith added bug Fixes for quality problems that affect the customer experience Feature:ObsHosts Hosts feature within Observability needs-refinement Needs PM's to refine scope labels May 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Align Observability Infrastructure Host CPU usage calculation with dashboard [Metrics System] Host Overview CPU Used calculation #182335

Align Observability Infrastructure Host CPU usage calculation with dashboard [Metrics System] Host Overview CPU Used calculation #182335

jeanfabrice commented May 2, 2024

elasticmachine commented May 2, 2024

felixbarny commented May 3, 2024

roshan-elastic commented May 7, 2024

andrewvc commented Jun 6, 2024

felixbarny commented Jun 7, 2024

Align Observability Infrastructure Host CPU usage calculation with dashboard [Metrics System] Host Overview CPU Used calculation #182335

Align Observability Infrastructure Host CPU usage calculation with dashboard [Metrics System] Host Overview CPU Used calculation #182335

Comments

jeanfabrice commented May 2, 2024

elasticmachine commented May 2, 2024

felixbarny commented May 3, 2024

roshan-elastic commented May 7, 2024

andrewvc commented Jun 6, 2024

felixbarny commented Jun 7, 2024