New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
psutil.cpu_count: Add argument(s) to allow differentiating "performance cores" from "efficiency cores" #2034
Comments
I saw your comment in email, but I don't see it on the issue, not sure if you deleted it, or if Github is not updating it's website correctly. Regarding the comment you shared vs. my proposal. The problem with your original comment is that
8 "performance cores" with SMT for a total of 16 logical "performance cores". 8 "efficiency cores" (physical and logical) without SMT. But it is reported as a 16 core / 24 logical core part, which is usually not helpful for creating worker threads and processes. Would definitely want a More references: |
Mmm... I'm not sure I fully understand how this would work in practice. If you're interested in "performance" vs. "efficiency" I guess you're supposed to know which CPUs (IDs) are "performant" vs. "efficient", and then tell the OS to assign a certain process to run on those CPUs. E.g., in hypothetical code: >>> psutil.performant_cpu_ids()
[0, 2, 4]
>>> psutil.Process().cpu_affinity([0, 2, 4]) # set If instead you only know the total number of those CPUs, what can you do with that info alone?
There's a topic about changing >>> psutil.cpu_count("performance")
4
>>> psutil.cpu_count("efficiency")
4
>>>
Sorry, I deleted it because I hit "submit" too son. |
Given what security engineers have done in recent years to all operating systems (especially Linux), setting thread affinity is a questionable proposition. However, with these new "hybrid" CPUs, it's pretty much a given that your CPU-intensive workload will get migrated to the performance cores (that's the entire point of a hybrid CPU). If a workload isn't CPU intensive, then it doesn't really matter if gets migrated or not. I'm not sure if the special kernel scheduling hacks around these CPUs will even respect thread affinity or not. But aligning the number of worker threads with the number of physical "performance" cores is still of critical importance. For the implementation, unless there are good OS-specific APIs around this, probably the best thing to do is maintain a list of CPU core families or code names and the associated core metadata such as "performance" vs. efficiency. This list would be relatively small, given that Intel, AMD, ARM, Apple and IBM/Power only release new core families once a year at most. The maintenance burden should be relatively low. |
How do you tell the OS to use those cores though? To my knowledge that's |
It is safe to assume that if your worker threads are CPU-heavy, that the OS kernel would automatically migrate the threads to the performance cores if they were already on the efficiency cores. Or if CPU usage goes down later, the threads could be migrated to efficiency cores. The CPU vendors contribute special scheduling code for hybrid CPUs to the various OS kernels. If these hybrid CPUs were being treated like normal CPUs, they would have wildly inconsistent performance. If thread affinity has been set, I'm not sure what would happen (user-set affinity could be ignored, or maybe not) the behavior would be implementation-specific per-OS. |
FYI, I just implemented a feature similar to this in Java (thus cross-platform). Some helpful notes if this gets implemented:
|
Hello Daniel, thanks for providing such details.
Do you provide that on a per-cpu basis? In that case, it seems to me this belongs more to a >>> psutil.cpu_info()
{'arch': 'x86_64',
'byteorder': 'little',
'flags': 'fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat '
'pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx '
'pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good '
'nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 '
'monitor ds_cpl vmx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid '
'sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx '
'f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb '
'invpcid_single pti ssbd ibrs ibpb stibp tpr_shadow vnmi '
'flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 hle avx2 smep '
'bmi2 erms invpcid rtm mpx rdseed adx smap clflushopt intel_pt '
'xsaveopt xsavec xgetbv1 xsaves dtherm ida arat pln pts hwp '
'hwp_notify hwp_act_window hwp_epp md_clear flush_l1d',
'l1d_cache': 32768,
'l1i_cache': 32768,
'l2_cache': 262144,
'l3_cache': 6291456,
'model': 'Intel(R) Core(TM) i7-6700HQ CPU @ 2.60GHz',
'vendor': 'GenuineIntel'} Perhaps we can add a
I wonder how this relates to >>> psutil.cpu_mode()
'performance'
>>> psutil.cpu_mode(percpu=True)
['performance', 'performance', 'performance', 'performance'] ...which could also be used for setting (actually this would be extremely cool): >>> psutil.cpu_mode("powersave")
>>> psutil.cpu_mode(['performance', 'performance', 'powersave', 'powersave']) Do you know if |
Depends on what you mean by "cpu" here. :) For my purposes, I created a new
Assuming HT is on, I have two separate enumerations: a In the Windows enumeration you would have (among other output) 16 PROCESSOR_RELATIONSHIP structures with For Linux For everything else, you're stuck enumerating textual output matching "CPU X" (presumably also logical processors) with a textual description of the processor, e.g, this dmesg. For now, all the ARM big.LITTLE chips are a known set of Cortex-A7x (P-) and Cortex-A5x (E-) names; and for Apple M1 we know they're all firestorm and icestorm. For now. I haven't yet seen a dmesg output on an Alder Lake chip, would be nice if I had an unlimited budget to buy one just to run the command. :-)
I wouldn't use "powersave" here; industry branding appears to align with "performance" and "efficiency" (or P-core and E-core). Also while current hybrid chips only have two types, in theory in the future we could have some mid-sized cores as well. Windows' choice of a "relative efficiency measure" aligns with this potential.
I don't think they are the same at all. You can (and often do) change frequency on one or more processors of the same type for performance considerations. The "capacity" represents a maximum performance level without scaling; and I don't think it's user-adjustable.
No, it's just an output identifying the type of chip, basically read-only. |
Just took some time to catch up the entire thread (that I skimmed earlier) and wanted to highlight a few points: The original request was just for the "number of performance cores" for the purpose of limiting workloads.
In your Windows implementation you are already iterating over an array of
Regarding #1392 (comment), and related comments about topology, you need a bit of both. In my project:
On most OS's, the combination of package, core, and chip id are necessary for full topology, with numa nodes+logical processors as a separate thing. On Windows there's a numbered logical topology with meaning in the OS (numa+logical) and a physical topology (package+core) without any numbering. You could, in theory, keep the efficiency value at the lowest level (that's how you're going to collect it on every OS except Windows) so you could have "performance logical processors" and "efficiency logical processors" (although I'd bet one non-HT efficiency LP could process one task faster than two HT efficiency LP could process two similar tasks). So given the proposed API:
Then:
|
FYI, the GCC Compile farm just made an M1 (4 performance+4 efficiency cores) Linux machine available, so I was able to test out my own API. Here's the output for my processor information implementation. You can see package 0 cores 0,1,2,3 are "efficiency" (lower #, 459) and package 1 cores 0,1,2,3 are "performance" (higher #, 1024):
|
Bumping this as I too am interested in having this functionality added. Pardon my typing as I am writing this out on mobile. As for the matter of what arguments should be used for the
As for the topic of operating systems having updated schedulers to handle heterogeneous CPU core architectures, there can definitely be times where a user would want to keep their processes on just the performance cores, just the efficiency cores, or some custom combination (like 1 performance core and two efficiency cores). I myself have an app that benchmarks different implementations of the same code for comparison purposes, where each different implementation would run on its own logical core. Previously ot just used all cores for the benchmark, but with performance/efficiency cores that skews the results to whatever implementation gets to be put on a process that's put on a performance core. |
To be more precise in terminology, "affinity" generally relates to processes being assigned to particular CPUs. I think "mask" or "bitmask" or "cpumask" is a better term to use: it is generally the argument when setting affinity. |
Good point, that is indeed what I meant and should've specified it better. I think the hardest part of this whole thing is actually finding out whether a core is "performance" or "efficiency" in a cross-platform way. I'm not aware of any portion of Windows or Linux that exposes that information to the user, although I'm sure there are reliable ways to do this that I'm not aware of. I was previously using pywin32 to get CPU information, but there weren't any specific attributes that give usable information on whether or not a given core is flagged as a performance or efficiency one. |
I described several ways in this comment and have implemented them (cross platform) in Java, links in other comments above. |
OP here. (new Github account) Regarding the topic of thread affinity on heterogeneous CPUs, I recently learned that MacOS and Windows have these APIs for setting I am not aware that Linux has a comparable API yet. But this seems to be the future of managing thread affinity, so enabling users to set thread affinity manually should probably not be a design goal for this feature. |
Agreed. However reporting processor numbers and their correspondence to core types should be. I'll leave it to others to define the API but on my (Java-based) project I have enough objects/lists to construct an output like this:
|
I think having the ability to view and change both the specific core affinities as well as the thread QoS would both be beneficial, and would work better in conjunction with one another. With just the thread QoS setting, you would only be able to see what QoS level an application is using at any given time. A combined approach would give you all the information you would need about the processor cores, and if someone is interested enough in manually setting core affinities manually down to the individual cores then they can look into implementing that on top of this. |
Summary
Description
Many applications need to spawn a number of worker threads or processes, where the number of physical cores is the ideal number of workers to create. The existing
psutil.cpu_count(logical=False)
has served us well in that regard, however, changes in modern hardware are causing that API to become inadequate.For example, ARM big.LITTLE (including Apple M1), Intel Alder Lake (and rumor has it future AMD CPUs) will feature a mix of "performance cores" and "efficiency cores", some of which may have SMT, and others not, even within the same CPU.
AFAIK, the use-case for
cpu_count
is almost always going to be the performance core count. Except in the odd corner case where the performance and efficiency cores can be used at the same time, and the performance delta does not matter because the individual jobs are small and there are many of them.My proposal is to add something like this:
The use of an enum should keep this feature future proof as as CPU core types become more exotic and diverse than what they are today.
Things to consider:
The text was updated successfully, but these errors were encountered: