Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Linux] Provide access to Pressure Stall Information metrics #1932

Open
MrPippin66 opened this issue Apr 10, 2021 · 4 comments
Open

[Linux] Provide access to Pressure Stall Information metrics #1932

MrPippin66 opened this issue Apr 10, 2021 · 4 comments

Comments

@MrPippin66
Copy link

OS: Linux (kernels at 4.20 or higher, unless vendor has back ported feature)
Type: Performance metrics for CPU, Memory & IO

Summary:

https://www.kernel.org/doc/html/latest/accounting/psi.html

Though the is a relative new feature, it will become a common use of information for determining performance issues on systems.

I would requests this information become available (if enabled in OS for psi and/or cgroup) via the psutil framework, primarily so that tools built atop this framework can readily use this for monitoring purposes.

Ultimately, I'd suggest a new category (psi) to gather these values from.

psutil.psi_cpu()

psutil.psi_memory()

psutil.psi_io()

I'd request both the system level and cgroup2 level data be presented for each category.

@giampaolo
Copy link
Owner

giampaolo commented Apr 11, 2021

Mmm... It's the first time I hear about this. I'm trying to understand how it works (https://unixism.net/2019/08/linux-pressure-stall-information-psi-by-example/). The information in those 3 files is easy to extract. What's more difficult is understanding how to interpret that data and imagine an actual use case. For instance, psutil doc shows an actual use case for psutil.getloadavg(), showing how to translate those raw numbers to get a percentage of CPU usage/load over time:

>>> import psutil
>>> psutil.getloadavg()
(3.14, 3.89, 4.67)
>>> psutil.cpu_count()
10
>>> # percentage representation over the last 1, 5, 15 mins
>>> [x / psutil.cpu_count() * 100 for x in psutil.getloadavg()]
[31.4, 38.9, 46.7]

If we were to add this I would like to see something similar to provide in the doc: some actual code which does something useful with those raw numbers extracted from /proc/pressure. But in order to do that I/we'd have to properly understand how this works first. =)

@MrPippin66
Copy link
Author

PSI was developed by Facebook. They posted a decent explanation of how they use it, and the benefits it's given them.

https://lwn.net/Articles/759658/

And FYI, that article gives detailed response of the issue with "getloadavg", which goes above the issues we've encountered (namely that you can have several active threads that are active for a small period of their allocated slice. They manifest as high load averages, but overall low CPU utilization).

And swap thrashing isn't the only memory utilization metric that results in low processor throughput, which this facility would include, without having complicated monitoring scripts (high reclaim rates, high faults, etc.)

I think having this available would simplify monitoring,. and hopefully would be used upstream in monitoring products, like "ncpa", etc.

@giampaolo giampaolo added new-api and removed api labels May 15, 2021
@MrPippin66
Copy link
Author

@giampaolo Is this still a feature candidate?

@MrPippin66
Copy link
Author

FYI, PSI data is reported in 'sar' data for all current Linux distributions. I think being able to report this data in 'psutil' merits attention.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants