-
Notifications
You must be signed in to change notification settings - Fork 8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Align Observability Infrastructure Host CPU usage calculation with dashboard [Metrics System] Host Overview CPU Used calculation #182335
Comments
Pinging @elastic/obs-ux-infra_services-team (Team:obs-ux-infra_services) |
IO wait is also an idle state. So, if you want to calculate the CPU usage, you sum up all states except for the idle states |
Thank you for this issue. At the moment, we don't have short-term plans to change this calculation as we have dependencies throughout the UI on the current formula (e.g. related inventory alerting rules for 'CPU Usage') that need to manage. However, in the medium-term, we do plan to solve for this by having some kind of a user-configurable 'metric library' which would enable us to have support different metrics. With something like this, we could preserve the existing behaviour for those who rely on it (e.g. for alerts) but allow us to promote newer metrics with better definitions. We're not sure on timelines at the moment but we're aiming to solve for this in the next 6-12 months. |
One thing that jumps out at me here is that our language around "CPU Usage" is just imprecise. There are multiple ways to calculate CPU usage; see this reference for details. For instance, it's not obvious to me that we should factor in @felixbarny you're right that just excluding wait states indicates how much of the CPU is busy, but what if the user is asking a different question, basically, how much of the app am "I" (speaking loosely here, since system is shared) using, where system + user is maybe a better fit? I'd propose that rather than redefine the CPU Usage alert we give it a default behavior, which could be its current behavior, and add an additional control that lets you change the way its calculated to whichever of the CPU fields the user seems to think is most useful. Perhaps this should be hidden a little in the UI as an advanced setting. Curious to hear what others think! |
Fair point, but I think that mostly applies to when looking at the CPU usage of a process. If you're looking at the CPU utilization for the whole host (such as what the host UI is doing), I'm not sure if it makes sense to exclude certain cpu states. I think a typical user would expect the CPU usage to range from 0-100%, and would want to be alerted when the usage is above a certain threshold. If the host experiences a lot of Aside from the question about the different CPU states, another issue with the formula is that the division by |
Describe the feature:
In Observability Infrastructure Host, CPU usage is calculated as:
whereas in the "[Metrics System] Host Overview" dashboard, the "CPU Used" indicator is directly using
system.cpu.total.norm.pct
Describe a specific use case for the feature:
Make dashboards coherent with each other.
Make sure indicators representing the same thing use the same metrics to calculate the CPU Usage.
If we decide to go with the average of CPU time (Observability Infrastructure Host method), we should probably include
system.cpu.nice.pct
,system.cpu.irq.pct
,system.cpu.softirq.pct
andsystem.cpu.iowait.pct
as wellIf we decide to go with
system.cpu.total.norm.pct
, what aboutiowait
CPU time, assystem.cpu.total.norm.pct
seems to exclude it:The text was updated successfully, but these errors were encountered: