Improve handling of reused constants in the register allocator #70182

tannergooding · 2022-06-02T22:40:39Z

Today https://github.com/dotnet/runtime/blob/main/docs/design/coreclr/jit/lsra-heuristic-tuning.md documents that it does FREE and then CONST_AVAILABLE. This ends up not being able to reuse an already enregistered constant if it wasn't "last use".

A simple example is the following where C81 and C82 both represent the CNS_DBL of 0.0:

68.#46  C81   Def    ORDER(A) mm0  |     |V13 a|     |V2  a|V3  a|V0  a|V1  a|     |     |V10 a|V4  a|V7  a|C81 a|     |     |V101a|     |
70.#47  C82   Def    ORDER(A) mm1  |     |V13 a|     |V2  a|V3  a|V0  a|V1  a|     |     |V10 a|V4  a|V7  a|C81 a|C82 a|     |V101a|     |

While these values aren't last use, the last use does happen before the constant is overwritten and in some cases we can end up with 4-5 registers all being initialized (as is the case for Matrix4x4.Decompose) to the same constant value (often zero so emitting xorps many times in a row).

It would be beneficial if the register allocator could improve this scenario and keep the constant in one register for such scenarios. This in particular appears to impact Zero and AllBitsSet since they are "cheap to compute" and so typically don't undergo CSE.

category:cq
theme:register-allocator

The text was updated successfully, but these errors were encountered:

ghost · 2022-06-02T22:40:45Z

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

Issue Details

Today https://github.com/dotnet/runtime/blob/main/docs/design/coreclr/jit/lsra-heuristic-tuning.md documents that it does FREE and then CONST_AVAILABLE. This ends up not being able to reuse an already enregistered constant if it wasn't "last use".

A simple example is the following where C81 and C82 both represent the CNS_DBL of 0.0:

68.#46  C81   Def    ORDER(A) mm0  |     |V13 a|     |V2  a|V3  a|V0  a|V1  a|     |     |V10 a|V4  a|V7  a|C81 a|     |     |V101a|     |
70.#47  C82   Def    ORDER(A) mm1  |     |V13 a|     |V2  a|V3  a|V0  a|V1  a|     |     |V10 a|V4  a|V7  a|C81 a|C82 a|     |V101a|     |

While these values aren't last use, the last use does happen before the constant is overwritten and in some cases we can end up with 4-5 registers all being initialized to the same constant value (often zero so emitting xorps many times in a row).

It would be beneficial if the register allocator could improve this scenario and keep the constant in one register for such scenarios. This in particular appears to impact Zero and AllBitsSet since they are "cheap to compute" and so typically don't undergo CSE.

Author:	tannergooding
Assignees:	-
Labels:	`area-CodeGen-coreclr`
Milestone:	-

kunalspathak · 2022-06-02T22:51:23Z

There is an assumption in LSRA that materializing constants are cheap and hence we don't want to tie it to a register. May be if we have an information of the occurances of particular constants (zeros for sure), we can have it assigned a register. I tried experimenting by switching the order of CONST_AVAILABLE and FREE and that (not surprisingly) shows lot of regressions:

[15:20:52]
[15:20:52] Top method regressions (bytes):
[15:20:52]           54 (20.00% of base) : 36178.dasm - System.Numerics.Tests.Perf_Matrix4x4:CreateConstrainedBillboardBenchmark():System.Numerics.Matrix4x4:this
[15:20:52]           43 ( 2.55% of base) : 34909.dasm - System.Numerics.Matrix4x4:Decompose(System.Numerics.Matrix4x4,byref,byref,byref):bool
[15:20:52]           28 (19.18% of base) : 10260.dasm - System.Numerics.Tests.Perf_Matrix4x4:CreateLookAtBenchmark():System.Numerics.Matrix4x4:this
[15:20:52]           28 (19.18% of base) : 17977.dasm - System.Numerics.Tests.Perf_Matrix4x4:CreateWorldBenchmark():System.Numerics.Matrix4x4:this
[15:20:52]           26 (13.27% of base) : 35808.dasm - System.Numerics.Tests.Perf_Matrix4x4:CreateBillboardBenchmark():System.Numerics.Matrix4x4:this
[15:20:52]           22 ( 6.83% of base) : 17425.dasm - System.Numerics.Tests.Perf_Plane:CreateFromVerticesBenchmark():System.Numerics.Plane:this
[15:20:52]           18 ( 1.91% of base) : 3355.dasm - Benchmarks.SIMD.RayTracer.RayTracer:CreateDefaultScene():Benchmarks.SIMD.RayTracer.Scene
[15:20:52]           10 ( 4.44% of base) : 33917.dasm - Benchstone.MDBenchI.MDXposMatrix:Test():bool:this
[15:20:52]            5 ( 0.38% of base) : 15958.dasm - Benchstone.MDBenchF.MDRomber:Test():bool:this
[15:20:52]            5 ( 1.61% of base) : 16620.dasm - Benchstone.MDBenchF.MDSqMtx:Test():bool:this
[15:20:52]
[15:20:52] Top method regressions (percentages):
[15:20:52]           54 (20.00% of base) : 36178.dasm - System.Numerics.Tests.Perf_Matrix4x4:CreateConstrainedBillboardBenchmark():System.Numerics.Matrix4x4:this
[15:20:52]           28 (19.18% of base) : 10260.dasm - System.Numerics.Tests.Perf_Matrix4x4:CreateLookAtBenchmark():System.Numerics.Matrix4x4:this
[15:20:52]           28 (19.18% of base) : 17977.dasm - System.Numerics.Tests.Perf_Matrix4x4:CreateWorldBenchmark():System.Numerics.Matrix4x4:this
[15:20:52]           26 (13.27% of base) : 35808.dasm - System.Numerics.Tests.Perf_Matrix4x4:CreateBillboardBenchmark():System.Numerics.Matrix4x4:this
[15:20:52]           22 ( 6.83% of base) : 17425.dasm - System.Numerics.Tests.Perf_Plane:CreateFromVerticesBenchmark():System.Numerics.Plane:this
[15:20:52]           10 ( 4.44% of base) : 33917.dasm - Benchstone.MDBenchI.MDXposMatrix:Test():bool:this
[15:20:52]           43 ( 2.55% of base) : 34909.dasm - System.Numerics.Matrix4x4:Decompose(System.Numerics.Matrix4x4,byref,byref,byref):bool
[15:20:52]           18 ( 1.91% of base) : 3355.dasm - Benchmarks.SIMD.RayTracer.RayTracer:CreateDefaultScene():Benchmarks.SIMD.RayTracer.Scene
[15:20:52]            5 ( 1.61% of base) : 16620.dasm - Benchstone.MDBenchF.MDSqMtx:Test():bool:this
[15:20:52]            5 ( 0.38% of base) : 15958.dasm - Benchstone.MDBenchF.MDRomber:Test():bool:this
[15:20:52]
[15:20:52] 10 total methods with Code Size differences (0 improved, 10 regressed), 0 unchanged.

windows/x64 coreclr:

kunalspathak · 2022-06-16T18:28:33Z

This will most likely not happen in .NET 7

tannergooding · 2022-06-16T19:18:46Z

This will most likely not happen in .NET 7

This seems fine to me. There are a few places where we still have what appear to be associated regressions most of which appear to be cases of Zero or AllBitsSet being initialized more than necessary.

This sometimes impacts loop alignment, etc. So we'll probably want to take a closer look in 8. But overall the changes related to GT_CNS_VEC are positive improvements and the regressions tend to be nanoseconds in range so it shouldn't be a huge priority in 7

tannergooding · 2022-06-17T15:33:17Z

Just adding in a summary that I gave elsewhere...

There are basically two fighting forces at play. On one hand we have constant folding, containment, and other optimizations that can only kick in when we know we have a constant. On the other hand, if we don't know that one of these optimizations are going to kick in then hoisting these values outside the loop is likely strictly better. However, given that we can't really know for sure what will be emitted until lowering, we're making vague heuristical guesses in HIR instead.

The premise of my proposal is that we should always CSE and always hoist constants into locals. We should then specially track that a given local represents a constant and we should prevent such locals from being rewritten (either via a new node kind GT_LCL_CNS or a flag on GT_LCL_VAR/GT_LCL_FLD).

Constant folding and lowering would then need to be updated to consider both GT_CNS_* and this new type of "constant local". This allows constant folding to still occur, it allows lowering to still perform containment, etc. However, it also ensures that if a given architecture or code path won't specially handle the constant that it is still hoisted outside the loop and won't be a repeated per-iteration cost.

A downside to this are that it may have have some minor impact on throughput. LSRA will also need to determine if all users of a local mark it as contained so that it can avoid initializing a value and burning a register unnecessarily. Much like with other scenarios, LSRA could understand and treat these constants specially and know they are cheaper to rematerialize and so therefore can be preference them for spilling if register contention is an issue (which should result in no-worse codegen than what we have today).

TIHan · 2023-02-02T02:28:55Z

Related: #76067

kunalspathak · 2023-06-05T18:25:10Z

I don't have time to look into this issue in .NET 8. We need to have a "User Story" in future release that contains all these LSRA issues to get them prioritized.

kunalspathak · 2023-06-16T01:44:22Z

While working on #87082, been debugging the codegen on osx/arm64 to see why we don't reuse the constant for such scenarios:

           movz    x0, #0x8C00
           movk    x0, #817 LSL #16
           movk    x0, #1 LSL #32
           movz    x1, #0x8C00
           movk    x1, #817 LSL #16
           movk    x1, #1 LSL #32
           ldr     x1, [x1]
           blr     x1

Idle code gen:

            movz    x0, #0x8C00
            movk    x0, #817 LSL #16
            movk    x0, #1 LSL #32
            ldr     x1, [x0]
            blr     x1

Multiple reasons:

We seem to track the registers containing constants, but as soon as we see a different interval, we reset the tracked register.

runtime/src/coreclr/jit/lsra.cpp

Lines 6112 to 6119 in 35a5afb

    
           if (interval->isConstant) 
        
           { 
        
               setConstantReg(reg->regNum, interval->registerType); 
        
           } 
        
           else 
        
           { 
        
               clearConstantReg(reg->regNum, interval->registerType); 
        
           }

We try to match the constant represented by IConNode or VecCons but not anything further like FtnAddress (which is the case in above example).
If a register is holding a constant, we mark it as "busy". When we try to allocate for another position that contains exact same constant, we run try_FREE() heuristic first, which eliminates the register holding constant from list of possible candidates. When we do try_CONST_AVAILABLE(), we never see the register that is holding the same constant. Hence, we end up assigning a new register. And of course, there are bunch of asserts that kicks in when I was trying to make this work.

I am actually wondering, why can't we just have a lookup table of N entries, where N is the number of available registers. key would be const and value would be reg. We store the constant against the register that is holding it. A new heuristic will check if the constant we are allocating for, is present in lookup and if yes, just give it the register present in lookup instead of assigning it a new register. Of course, at block boundaries, we should flush this lookup table, so we don't hold the entries that are true in one pred and not in another.

kunalspathak · 2023-06-22T15:02:53Z

Another example from @jakobbotsch - #76928 (comment)

ghost added the untriaged New issue has not been triaged by the area owner label Jun 2, 2022

dotnet-issue-labeler bot added area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI and removed untriaged New issue has not been triaged by the area owner labels Jun 2, 2022

kunalspathak added this to the 7.0.0 milestone Jun 2, 2022

tannergooding mentioned this issue Jun 3, 2022

Fixing the costing of GT_CNS_DBL and GT_CNS_VEC instructions #70215

Merged

JulieLeeMSFT assigned kunalspathak Jun 3, 2022

kunalspathak modified the milestones: 7.0.0, 8.0.0 Jun 16, 2022

kunalspathak mentioned this issue Jun 17, 2022

Implement Vector128 version of System.Buffers.Text.Base64 DecodeFromUtf8 and EncodeToUtf8 #70654

Merged

tannergooding mentioned this issue Aug 17, 2022

Test failure JIT\\HardwareIntrinsics\\General\\Vector256\\Vector256_ro\\Vector256_ro.cmd #70642

Closed

kunalspathak mentioned this issue Aug 17, 2022

Disable matching constants for vectors that needs upper half to be save/restore #74110

Merged

kunalspathak mentioned this issue Oct 28, 2022

[RyuJIT][LSRA] Add support for rematerialization #6264

Closed

jakobbotsch mentioned this issue Dec 16, 2022

Flaky CSE around pdep mask #477

Open

TimothyMakkison mentioned this issue Dec 21, 2022

Modernisation/Performance boost POC Cysharp/Ulid#39

Merged

kunalspathak removed this from the 8.0.0 milestone Jun 5, 2023

kunalspathak added this to the Future milestone Jun 5, 2023

jakobbotsch mentioned this issue Jun 8, 2023

JIT: Generalized struct promotion #76928

Closed

40 tasks

tannergooding mentioned this issue Jun 9, 2023

Regressions in System.Numerics.Tests.Perf_Plane #70497

Closed

AndyAyersMS mentioned this issue Jan 27, 2024

(Performance) Simd vector loaded from memory instead of register multiple times #97046

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve handling of reused constants in the register allocator #70182

Improve handling of reused constants in the register allocator #70182

tannergooding commented Jun 2, 2022 •

edited by BruceForstall

ghost commented Jun 2, 2022

kunalspathak commented Jun 2, 2022

kunalspathak commented Jun 16, 2022

tannergooding commented Jun 16, 2022

tannergooding commented Jun 17, 2022

TIHan commented Feb 2, 2023

kunalspathak commented Jun 5, 2023

kunalspathak commented Jun 16, 2023 •

edited

kunalspathak commented Jun 22, 2023

Improve handling of reused constants in the register allocator #70182

Improve handling of reused constants in the register allocator #70182

Comments

tannergooding commented Jun 2, 2022 • edited by BruceForstall

ghost commented Jun 2, 2022

kunalspathak commented Jun 2, 2022

kunalspathak commented Jun 16, 2022

tannergooding commented Jun 16, 2022

tannergooding commented Jun 17, 2022

TIHan commented Feb 2, 2023

kunalspathak commented Jun 5, 2023

kunalspathak commented Jun 16, 2023 • edited

kunalspathak commented Jun 22, 2023

tannergooding commented Jun 2, 2022 •

edited by BruceForstall

kunalspathak commented Jun 16, 2023 •

edited