-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Consider using SIMD registers for "hot" local variables instead of placing them on stack when out of free GP registers #10444
Comments
@voinokin, the new Hardware Intrinsics feature (and the existing System.Numerics.Vector) feature makes heavy use of SIMD instructions. |
@tannergooding, indeed it does (and my respect for participating in all this!)
What I mean logging this record here is - there are lots of places in system assemblies which DO NOT use any SIMD facilities at least for now, and in really many cases there WILL NEVER BE any relation to SIMD. In such cases, adding the ability to trade stack memory access ops for ops with (unused!) SIMD regs will improve performance. |
However, this may also:
If something like this would be done, there would need to be an initial prototype clearly showing the gains this would provide and any drawbacks that would incur. |
Here is live example. |
In theory it's a good idea. In practice it may be difficult to prove that this is a consistent improvement. AFAIR someone suggested this years ago to the VC++ guys but I don't think they implemented it. If that's because they didn't have time or because there are problems associated with this idea I do not know.
Hmm, last time I checked recent CPUs (e.g. Skylake) had 2 load ports so memory loads technically have throughput 1/2. Instructions such as Recently I played a bit with |
True - the scope is large.
I'm willing to participate in this, but that's a matter of my spare time unfortunately. |
I wonder if what helps in that scenario is the fact that the variables are kept in registers or the fact that perhaps you're freeing a bit of CPU cache memory. |
The variables' footprint is just 16 bytes, which is no more than 1 cache line. |
BTW, the issue #10394 I logged a while ago is related exactly to attempt to calculate and compare performance w.r.t. different microarchitectures. Could you please suggest some way how to do this? Is it required to make my personal modified build of JIT code and play with it, or there is better way to do? |
@dotnet/jit-contrib |
I don't think that does really improves for general scenarios. also, load and storing on SIMD registers will cause stall unless you are going to use SIMD instructions for If there is any cases that makes |
@ArtBlnd Meanwhile, according to the tables from both Agner Fog and Intel themselves, the numbers for unpenalized READ on Intel CPUs Nehalem thru Skylake are: Unpenalized WRITE: Also, with the support of the feature suggested, it will become possible to directly transfer data between local vars - it's quite common case to see sequences of Maybe some time later I will add numbers for AMDs (have no deep experience with these). A side note - I doubt that storing more than one local var in SIMD reg is good idea due to the timing of |
Here are numbers for L1D cache access taken from https://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-optimization-manual.pdf |
Regarding reg read stalls
From Agner Fog's manual http://www.agner.org/optimize/microarchitecture.pdf section "9 Sandy Bridge and Ivy Bridge pipeline":
|
@voinokin Anyways, can I have some cases? |
@ArtBlnd |
Couple thoughts and observations I've got on this topic while working on my hi-perf multithreaded app.
P.S. This is again related to Intel CPUs, can't tell anything regarding AMDs for now since I have no experience with their uarch yet, and this is not primary priority to me for now. |
We can't spill general purpose registers that hold GC-refs of Byrefs into the XMM registers, |
Related article: https://shipilev.net/jvm/anatomy-quarks/20-fpu-spills/ |
The idea is intuitive, though I'm not sure it was ever sounded in context of CLR JIT - why not use X/Y/ZMM registers for "hot" local variables to avoid stack memory accesses, just like common GP registers are used to load/store the values? (I'm not talking here about operations other than load/store
MOVQ/MOVD
because it's much deeper topic which may include auto-vectorization and other funny stuff.)There are always up to at least 6 volatile SIMD registers, and the number of regs used may be increased up to the size of SIMD register file. With more complex techniques this may provide up to 8 regs for x86/SSE+, up to 16 regs for x64/SSE+, up to 32 regs for x64/AVX-512 (future). These numbers may be achievable in the context of CLR due to the fact that at the moment few code in system assemblies uses vectors, and to my understanding SIMD ops are now only used for FP operations otherwise.
Even taking into account the store forwarding mechanisms implemented in modern CPUs when accessing memory, the significant speed-up could be achieved. One extra point is that on HyperThreaded CPUs the register files are independent on each other, whereas memory access circuitry is mostly shared by (sub-)cores.
category:design
theme:register-allocator
skill-level:expert
cost:large
impact:large
The text was updated successfully, but these errors were encountered: