Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Align Observability Infrastructure Host CPU usage calculation with dashboard [Metrics System] Host Overview CPU Used calculation #182335

Open
jeanfabrice opened this issue May 2, 2024 · 5 comments
Labels
bug Fixes for quality problems that affect the customer experience Feature:ObsHosts Hosts feature within Observability needs-refinement Needs PM's to refine scope Team:obs-ux-infra_services Observability Infrastructure & Services User Experience Team

Comments

@jeanfabrice
Copy link

Describe the feature:

In Observability Infrastructure Host, CPU usage is calculated as:

(average(system.cpu.user.pct) + average(system.cpu.system.pct)) / max(system.cpu.cores)
image

whereas in the "[Metrics System] Host Overview" dashboard, the "CPU Used" indicator is directly using system.cpu.total.norm.pct
image

Describe a specific use case for the feature:
Make dashboards coherent with each other.
Make sure indicators representing the same thing use the same metrics to calculate the CPU Usage.

If we decide to go with the average of CPU time (Observability Infrastructure Host method), we should probably include system.cpu.nice.pct, system.cpu.irq.pct, system.cpu.softirq.pct and system.cpu.iowait.pct as well

If we decide to go with system.cpu.total.norm.pct, what about iowait CPU time, as system.cpu.total.norm.pct seems to exclude it:

The percentage of CPU time spent in states other than Idle and IOWait.

@botelastic botelastic bot added the needs-team Issues missing a team label label May 2, 2024
@jeanfabrice jeanfabrice added the Team:obs-ux-infra_services Observability Infrastructure & Services User Experience Team label May 2, 2024
@elasticmachine
Copy link
Contributor

Pinging @elastic/obs-ux-infra_services-team (Team:obs-ux-infra_services)

@botelastic botelastic bot removed the needs-team Issues missing a team label label May 2, 2024
@smith smith added bug Fixes for quality problems that affect the customer experience Feature:ObsHosts Hosts feature within Observability needs-refinement Needs PM's to refine scope labels May 2, 2024
@felixbarny
Copy link
Member

If we decide to go with system.cpu.total.norm.pct, what about iowait CPU time, as system.cpu.total.norm.pct seems to exclude it:

IO wait is also an idle state. So, if you want to calculate the CPU usage, you sum up all states except for the idle states idle and iowait.

@roshan-elastic
Copy link

Thank you for this issue.

At the moment, we don't have short-term plans to change this calculation as we have dependencies throughout the UI on the current formula (e.g. related inventory alerting rules for 'CPU Usage') that need to manage.

However, in the medium-term, we do plan to solve for this by having some kind of a user-configurable 'metric library' which would enable us to have support different metrics. With something like this, we could preserve the existing behaviour for those who rely on it (e.g. for alerts) but allow us to promote newer metrics with better definitions.

We're not sure on timelines at the moment but we're aiming to solve for this in the next 6-12 months.

@andrewvc
Copy link
Contributor

andrewvc commented Jun 6, 2024

One thing that jumps out at me here is that our language around "CPU Usage" is just imprecise. There are multiple ways to calculate CPU usage; see this reference for details. For instance, it's not obvious to me that we should factor in steal in an alerting scenario, or if we should factor nice in for that matter.

@felixbarny you're right that just excluding wait states indicates how much of the CPU is busy, but what if the user is asking a different question, basically, how much of the app am "I" (speaking loosely here, since system is shared) using, where system + user is maybe a better fit?

I'd propose that rather than redefine the CPU Usage alert we give it a default behavior, which could be its current behavior, and add an additional control that lets you change the way its calculated to whichever of the CPU fields the user seems to think is most useful. Perhaps this should be hidden a little in the UI as an advanced setting.

Curious to hear what others think!

@felixbarny
Copy link
Member

@felixbarny you're right that just excluding wait states indicates how much of the CPU is busy, but what if the user is asking a different question, basically, how much of the app am "I" (speaking loosely here, since system is shared) using, where system + user is maybe a better fit?

Fair point, but I think that mostly applies to when looking at the CPU usage of a process. If you're looking at the CPU utilization for the whole host (such as what the host UI is doing), I'm not sure if it makes sense to exclude certain cpu states. I think a typical user would expect the CPU usage to range from 0-100%, and would want to be alerted when the usage is above a certain threshold. If the host experiences a lot of steal time, that takes away resources for my applications, and I'd like to be alerted when the total utilization reaches a certain threshold.

Aside from the question about the different CPU states, another issue with the formula is that the division by max(system.cpu.cores), which is supposed to normalize the utilization to the range of 0-1 isn't quite correct. When looking at multiple hosts at once, it uses the number of cores of the host with the most CPU cores. So the normalization only works when all hosts have the same number of cores. I think we should use the *.norm.pct equivalents that are already normalized.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Fixes for quality problems that affect the customer experience Feature:ObsHosts Hosts feature within Observability needs-refinement Needs PM's to refine scope Team:obs-ux-infra_services Observability Infrastructure & Services User Experience Team
Projects
None yet
Development

No branches or pull requests

6 participants