Consider using SIMD registers for "hot" local variables instead of placing them on stack when out of free GP registers #10444

voinokin · 2018-06-04T12:33:15Z

The idea is intuitive, though I'm not sure it was ever sounded in context of CLR JIT - why not use X/Y/ZMM registers for "hot" local variables to avoid stack memory accesses, just like common GP registers are used to load/store the values? (I'm not talking here about operations other than load/store MOVQ/MOVD because it's much deeper topic which may include auto-vectorization and other funny stuff.)

There are always up to at least 6 volatile SIMD registers, and the number of regs used may be increased up to the size of SIMD register file. With more complex techniques this may provide up to 8 regs for x86/SSE+, up to 16 regs for x64/SSE+, up to 32 regs for x64/AVX-512 (future). These numbers may be achievable in the context of CLR due to the fact that at the moment few code in system assemblies uses vectors, and to my understanding SIMD ops are now only used for FP operations otherwise.

Even taking into account the store forwarding mechanisms implemented in modern CPUs when accessing memory, the significant speed-up could be achieved. One extra point is that on HyperThreaded CPUs the register files are independent on each other, whereas memory access circuitry is mostly shared by (sub-)cores.

category:design
theme:register-allocator
skill-level:expert
cost:large
impact:large

The text was updated successfully, but these errors were encountered:

tannergooding · 2018-06-04T14:25:46Z

@voinokin, the new Hardware Intrinsics feature (and the existing System.Numerics.Vector) feature makes heavy use of SIMD instructions.

voinokin · 2018-06-04T14:36:02Z

@tannergooding, indeed it does (and my respect for participating in all this!)
Still, it is system or user code developer who chooses whether to use such intrinsics or not in particular place up to the app's need. So this does not contradict:

These numbers may be achievable in the context of CLR due to the fact that at the moment few code in system assemblies uses vectors, and to my understanding SIMD ops are now only used for FP operations otherwise.

What I mean logging this record here is - there are lots of places in system assemblies which DO NOT use any SIMD facilities at least for now, and in really many cases there WILL NEVER BE any relation to SIMD. In such cases, adding the ability to trade stack memory access ops for ops with (unused!) SIMD regs will improve performance.

tannergooding · 2018-06-04T14:42:58Z

However, this may also:

Break existing code that expects things to be on the stack
- The debugger (less of a concern if optimizations are enabled)
- User code that uses unsafe
- etc
Have its own overhead/latency (transferring between XMM and GPR is cheap, but not necessarily free, and could hurt pipelining in some cases)
etc

If something like this would be done, there would need to be an initial prototype clearly showing the gains this would provide and any drawbacks that would incur.

voinokin · 2018-06-04T14:50:24Z

Here is live example.
At the moment I'm developing the tool intended to sort large sets of data (up to 100 Gb).
Some steps of the algorithm showed perf improvement when I manually placed some of the local variables to SIMD registers available thru HW intrinstics which otherwise were automatically allocated on stack by JIT.
With just putting 2 local vars to SIMD regs the tool gained 10-20% improvement in throughoutput because the steps I mention are on the critical path (well, most part of the code is on a critical path when one's talking about sorting algorithms ;-) ).

mikedn · 2018-06-04T14:56:01Z

In theory it's a good idea. In practice it may be difficult to prove that this is a consistent improvement.

AFAIR someone suggested this years ago to the VC++ guys but I don't think they implemented it. If that's because they didn't have time or because there are problems associated with this idea I do not know.

One extra point is that on HyperThreaded CPUs the register files are independent on each other, whereas memory access circuitry is mostly shared by (sub-)cores.

Hmm, last time I checked recent CPUs (e.g. Skylake) had 2 load ports so memory loads technically have throughput 1/2. Instructions such as movd have throughput only 1.

Recently I played a bit with movd to "reinterpret" float as int and the results didn't seem too promising. Going through memory seemed faster, at least in some scenarios.

voinokin · 2018-06-04T14:57:31Z

If something like this would be done, there would need to be an initial prototype clearly showing the gains this would provide and any drawbacks that would incur.

True - the scope is large.

In theory it's a good idea. In practice it may be difficult to prove that this is a consistent improvement.

I'm willing to participate in this, but that's a matter of my spare time unfortunately.
Maybe in some time I will prepare perf test results measured current vs. suggested (will try to copy machine code to asm replacing stack access with SIMD regs access) for some common cases.

mikedn · 2018-06-04T14:57:40Z

At the moment I'm developing the tool intended to sort large sets of data (up to 100 Gb).

I wonder if what helps in that scenario is the fact that the variables are kept in registers or the fact that perhaps you're freeing a bit of CPU cache memory.

voinokin · 2018-06-04T14:58:48Z

At the moment I'm developing the tool intended to sort large sets of data (up to 100 Gb).
I wonder if what helps in that scenario is the fact that the variables are kept in registers or the fact that perhaps you're freeing a bit of CPU cache memory.

The variables' footprint is just 16 bytes, which is no more than 1 cache line.

voinokin · 2018-06-04T15:20:39Z

BTW, the issue #10394 I logged a while ago is related exactly to attempt to calculate and compare performance w.r.t. different microarchitectures. Could you please suggest some way how to do this? Is it required to make my personal modified build of JIT code and play with it, or there is better way to do?

RussKeldorph · 2018-06-04T17:35:38Z

@dotnet/jit-contrib

ArtBlnd · 2018-06-05T02:51:41Z

I don't think that does really improves for general scenarios.
hot memory access means that variable will be used more than general, which will be cached on CPU cache line. (storing on stack for hot variable? seems JIT did wrong register selection)

also, load and storing on SIMD registers will cause stall unless you are going to use SIMD instructions for hot memory area.

If there is any cases that makes hot memory area to stack. needs take a look closer to it.

voinokin · 2018-06-05T11:04:36Z

@ArtBlnd
I'm aware of reg read stalls, still the cache line access may be slower and occupies load or store port. IIRC the numbers which are added are: 3 cycles for reg read stall vs 4 cycles when reading from L1D cache. I will confirm these numbers later providing the source, and will also confirm the instruction encoding lengths for both cases.

Meanwhile, according to the tables from both Agner Fog and Intel themselves, the numbers for unpenalized READ on Intel CPUs Nehalem thru Skylake are:
MOV r, [m] - L=2c, T=0.5c, 1uop, runs on p2/3 (these load ports better be busy with something more useful :-) )
MOVD r32/r64, x/ymm - L=2c, T=1c, 1uop, runs on p0 (L=1c before Skylake, overall better before Haswell)
MOVQ r64, x/ymm - same as above

Unpenalized WRITE:
MOV [m], r - L=2c, T=1c, 2uop, runs on p2/3/[7] + p4 (L=3c, T=1c, 1uop before Skylake)
MOVD x/ymm, r32/r64 - L=2c, T=1c, 1uop, runs on p5 (L=1c before Skylake, overall better before Haswell)
MOVQ x/ymm, r64 - same as above

Also, with the support of the feature suggested, it will become possible to directly transfer data between local vars - it's quite common case to see sequences of MOVs between two stack locations in the code CLR JIT produces from IL. When vars are in SIMD regs it's just enough to do MOVAPS / MOVQ / MOVDQA, these instructions already have very good numbers - L=0-1c, T=0.25-0.33c, 1uop, run on p0/1/5, basically the same as common MOV r, r. This is ways faster than loading value from stack to GP reg, then storing it back.
Also, there is some benefit for the cases when it comes to the conversion between int <-> float data performed now with CVTxx2yy - the source value is already in SIMD reg, and it's not uncommon that the resulting value will be kept in other local var for some time.

Maybe some time later I will add numbers for AMDs (have no deep experience with these).

A side note - I doubt that storing more than one local var in SIMD reg is good idea due to the timing of PINSRx/PEXTRx instructions - L=3c, T=2c, 2uops.

voinokin · 2018-06-05T11:19:24Z

Here are numbers for L1D cache access taken from https://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-optimization-manual.pdf

voinokin · 2018-06-05T11:46:13Z

Regarding reg read stalls
The same manual tells in section "3.5.2.1 ROB Read Port Stalls":

As a micro-op is renamed, it determines whether its source operands have executed and been written to the reorder buffer (ROB), or whether they will be captured “in flight” in the RS or in the bypass network. Typically, the great majority of source operands are found to be “in flight” during renaming. Those that have been written back to the ROB are read through a set of read ports. Since the Intel Core microarchitecture is optimized for the common case where the operands are “in flight”, it does not provide a full set of read ports to enable all renamed micro-ops to read all sources from the ROB in the same cycle. When not all sources can be read, a micro-op can stall in the rename stage until it can get access to enough ROB read ports to complete renaming the micro-op. This stall is usually short-lived. Typically, a micro-op will complete renaming in the next cycle, but it appears to the application as a loss of rename bandwidth. [...skipped, then:...] Starting with Intel microarchitecture code name Sandy Bridge, ROB port stall no longer applies because data is read from the physical register file.

From Agner Fog's manual http://www.agner.org/optimize/microarchitecture.pdf section "9 Sandy Bridge and Ivy Bridge pipeline":

9.9 Register read stalls
Register read stalls has been a serious, and often neglected, bottleneck in previous
processors since the Pentium Pro. All Intel processors based on the P6 microarchitecture
and its successors, the Pentium M, Core and Nehalem microarchitectures have a limitation
of two or three reads from the permanent register file per clock cycle.
This bottleneck has now finally been removed in the Sandy Bridge and Ivy Bridge. In my
experiments, I have found no practical limit to the number of register reads.

ArtBlnd · 2018-06-06T11:13:31Z

@voinokin
Okay, than this makes sense.

Anyways, can I have some cases?
It will be great if you attach assemblies or source code(whatever its C or C++, C#) to help this out.

voinokin · 2018-06-06T18:47:34Z

@ArtBlnd
I'll come back later with some examples from real life to be close to "general cases", hopefully with perf measured.

voinokin · 2018-07-26T10:33:14Z

Couple thoughts and observations I've got on this topic while working on my hi-perf multithreaded app.

When variable roundtrips GP reg<->stack loc. are replaced with GP reg<->SIMD reg roundtrips, overall impression is that although the code I updated this way works with more or less same speed (maybe slightly slower on Nehalem which still has reg read stalls bottleneck mentioned couple posts earlier), OTHER methods started working slightly faster according to VTune. My first guess is that the pressure on load/store CPU ports and overall RAM bus pressure coming from such methods lessens giving more breath to other methods accessing memory. I'll try to confirm this and provide numbers some time later. If this idea is right, then replacing stack loc. accesses with SIMD regs accesses may happen to be even more beneficial on HyperThreaded CPUs since they have execution units located on the same physical core shared among logical cores (this is how I understand it), which of course include load/store units. Not sure about inter-GP/SIMD reg transfer units, but I suppose they are independent among logical cores since the register files are independent.
(Related) When implementing C++ style iterators with C# having their data stored as VTs on stack and code fully inlineable, it appears that there is significant amount of stack loc. <-> GP reg roundtrips. Such iterators are often used in tight loops, so such roundtrips are no good thing. I'll try to model replacement with SIMD reg <-> GP reg roundtrips to confirm benefits. Also, I'll investigate this closer for typical C# iteration loops over array elements including ubiquitous for loops.

P.S. This is again related to Intel CPUs, can't tell anything regarding AMDs for now since I have no experience with their uarch yet, and this is not primary priority to me for now.

tannergooding · 2019-04-03T20:08:09Z

It might be worth noting that the current Intel® 64 and IA-32 Architectures Optimization Reference Manual (April 2019) actually suggests spilling general-purpose registers to XMM registers (I also see this suggestion in the April 2018 edition, but didn't dive further back).

briansull · 2019-04-04T01:01:35Z

We can't spill general purpose registers that hold GC-refs of Byrefs into the XMM registers,
since we don't currently support reporting such registers in the GC info. (and it would a a lot of work add such support)

EgorBo · 2023-06-08T17:07:02Z

Related article: https://shipilev.net/jvm/anatomy-quarks/20-fpu-spills/

msftgits transferred this issue from dotnet/coreclr Jan 31, 2020

msftgits added this to the Future milestone Jan 31, 2020

SingleAccretion mentioned this issue Dec 9, 2022

Consider spilling GPR in vector registers instead of memory #79464

Closed

kunalspathak self-assigned this Dec 9, 2022

kunalspathak mentioned this issue Oct 6, 2023

Arm64: Add SVE/SVE2 support in .NET 9 #93095

Closed

31 tasks

kunalspathak mentioned this issue Nov 7, 2023

Improving Arm64 Performance in .NET 9.0 #94464

Closed

13 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Consider using SIMD registers for "hot" local variables instead of placing them on stack when out of free GP registers #10444

Consider using SIMD registers for "hot" local variables instead of placing them on stack when out of free GP registers #10444

voinokin commented Jun 4, 2018 •

edited by BruceForstall

Loading

tannergooding commented Jun 4, 2018

voinokin commented Jun 4, 2018 •

edited

Loading

tannergooding commented Jun 4, 2018

voinokin commented Jun 4, 2018 •

edited

Loading

mikedn commented Jun 4, 2018

voinokin commented Jun 4, 2018

mikedn commented Jun 4, 2018

voinokin commented Jun 4, 2018

voinokin commented Jun 4, 2018 •

edited

Loading

RussKeldorph commented Jun 4, 2018

ArtBlnd commented Jun 5, 2018 •

edited

Loading

voinokin commented Jun 5, 2018 •

edited

Loading

voinokin commented Jun 5, 2018 •

edited

Loading

voinokin commented Jun 5, 2018 •

edited

Loading

ArtBlnd commented Jun 6, 2018

voinokin commented Jun 6, 2018

voinokin commented Jul 26, 2018 •

edited

Loading

tannergooding commented Apr 3, 2019

briansull commented Apr 4, 2019

EgorBo commented Jun 8, 2023

Consider using SIMD registers for "hot" local variables instead of placing them on stack when out of free GP registers #10444

Consider using SIMD registers for "hot" local variables instead of placing them on stack when out of free GP registers #10444

Comments

voinokin commented Jun 4, 2018 • edited by BruceForstall Loading

tannergooding commented Jun 4, 2018

voinokin commented Jun 4, 2018 • edited Loading

tannergooding commented Jun 4, 2018

voinokin commented Jun 4, 2018 • edited Loading

mikedn commented Jun 4, 2018

voinokin commented Jun 4, 2018

mikedn commented Jun 4, 2018

voinokin commented Jun 4, 2018

voinokin commented Jun 4, 2018 • edited Loading

RussKeldorph commented Jun 4, 2018

ArtBlnd commented Jun 5, 2018 • edited Loading

voinokin commented Jun 5, 2018 • edited Loading

voinokin commented Jun 5, 2018 • edited Loading

voinokin commented Jun 5, 2018 • edited Loading

ArtBlnd commented Jun 6, 2018

voinokin commented Jun 6, 2018

voinokin commented Jul 26, 2018 • edited Loading

tannergooding commented Apr 3, 2019

briansull commented Apr 4, 2019

EgorBo commented Jun 8, 2023

voinokin commented Jun 4, 2018 •

edited by BruceForstall

Loading

voinokin commented Jun 4, 2018 •

edited

Loading

voinokin commented Jun 4, 2018 •

edited

Loading

voinokin commented Jun 4, 2018 •

edited

Loading

ArtBlnd commented Jun 5, 2018 •

edited

Loading

voinokin commented Jun 5, 2018 •

edited

Loading

voinokin commented Jun 5, 2018 •

edited

Loading

voinokin commented Jun 5, 2018 •

edited

Loading

voinokin commented Jul 26, 2018 •

edited

Loading