Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enregister structs on win x64. #55045

Merged
merged 3 commits into from
Jul 9, 2021
Merged

Conversation

sandreenko
Copy link
Contributor

@sandreenko sandreenko commented Jul 1, 2021

This PR enables struct enregistration for x64 windows (the only platform without multiregs).

Contributes to #43867.

diffs:
libraries_tests.pmi.windows.x64.checked.mch
Total bytes of delta: -10799
532 total files with Code Size differences (405 improved, 127 regressed)

libraries.pmi.windows.x64.checked.mch
Total bytes of delta: -9534
253 total files with Code Size differences (217 improved, 36 regressed)

libraries.crossgen2.windows.x64.checked.mch
Total bytes of delta: -9243
701 total files with Code Size differences (593 improved, 108 regressed)

coreclr_tests.pmi.windows.x64.checked.mch
Total bytes of delta: -18353
201 total files with Code Size differences (200 improved, 1 regressed)

benchmarks.run.windows.x64.checked.mch
Total bytes of delta: -1166
25 total files with Code Size differences (23 improved, 2 regressed)

The regressions are expected, they are caused by:

  1. change in lcl frame size from (N1 %16 == 0) to (N2 < N1 but N1 % 16 != 0) causes more instructions in the prolog to do zero-init.
  2. push/pop of callee-saved registers that we did not use before costs us 2 instructions.

There are tiny changes on other platforms because of improved lowering (IsEnregisterableLcl return false oftener and allows more contained cases) not worth mentioning here (all improvements, no regressions).

Note that it is not the final state, the biggest beast is call(obj(addr(lcl_var))) and I will be dealing with it separately.

Benchmarks improvement (for improvements analysis)
Top method regressions (percentages):
           8 ( 1.55% of base) : 20377.dasm - Microsoft.CodeAnalysis.CSharp.Binder:BindDeclaratorArguments(Microsoft.CodeAnalysis.CSharp.Syntax.VariableDeclaratorSyntax,Microsoft.CodeAnalysis.DiagnosticBag):System.Collections.Immutable.ImmutableArray`1[[Microsoft.CodeAnalysis.CSharp.BoundExpression, Microsoft.CodeAnalysis.CSharp, Version=2.10.0.0, Culture=neutral, PublicKeyToken=31bf3856ad364e35]]:this
           3 ( 0.22% of base) : 10240.dasm - <WriteAsync>d__21:MoveNext():this

Top method improvements (percentages):
         -12 (-21.82% of base) : 15891.dasm - System.Numerics.Tests.Perf_Matrix4x4:CreateReflectionBenchmark():System.Numerics.Matrix4x4:this
         -30 (-18.29% of base) : 23340.dasm - System.Numerics.Tests.Perf_Quaternion:MultiplyByQuaternionBenchmark():System.Numerics.Quaternion:this
        -156 (-13.74% of base) : 18406.dasm - BinopEasyOut:TypeToIndex(Microsoft.CodeAnalysis.CSharp.Symbols.TypeSymbol):System.Nullable`1[Int32]
         -43 (-13.23% of base) : 20176.dasm - Microsoft.CodeAnalysis.CSharp.Syntax.InternalSyntax.LanguageParser:IsPossibleTypedIdentifierStart(Microsoft.CodeAnalysis.CSharp.Syntax.InternalSyntax.SyntaxToken,Microsoft.CodeAnalysis.CSharp.Syntax.InternalSyntax.SyntaxToken,bool):System.Nullable`1[Boolean]:this
         -12 (-10.34% of base) : 22931.dasm - System.Numerics.Tests.Perf_Quaternion:LerpBenchmark():System.Numerics.Quaternion:this
         -96 (-8.99% of base) : 8287.dasm - System.TimeZoneInfo:CreateAdjustmentRuleFromTimeZoneInformation(byref,System.DateTime,System.DateTime,int):AdjustmentRule
         -80 (-8.06% of base) : 9184.dasm - Jil.Common.Utils:_ReadFieldOperands(System.Reflection.Emit.OpCode,System.Byte[],int,int,byref,byref,byref,byref):System.Nullable`1[Int32]
         -44 (-3.26% of base) : 10850.dasm - Newtonsoft.Json.JsonTextReader:ReadAsBoolean():System.Nullable`1[Boolean]:this
         -12 (-2.09% of base) : 17034.dasm - Microsoft.CodeAnalysis.CSharp.CSharpCompilation:CreateModuleBuilder(Microsoft.CodeAnalysis.Emit.EmitOptions,Microsoft.CodeAnalysis.IMethodSymbol,System.IO.Stream,System.Collections.Generic.IEnumerable`1[[Microsoft.CodeAnalysis.EmbeddedText, Microsoft.CodeAnalysis, Version=2.10.0.0, Culture=neutral, PublicKeyToken=31bf3856ad364e35]],System.Collections.Generic.IEnumerable`1[[Microsoft.CodeAnalysis.ResourceDescription, Microsoft.CodeAnalysis, Version=2.10.0.0, Culture=neutral, PublicKeyToken=31bf3856ad364e35]],Microsoft.CodeAnalysis.CodeGen.CompilationTestData,Microsoft.CodeAnalysis.DiagnosticBag,System.Threading.CancellationToken):Microsoft.CodeAnalysis.Emit.CommonPEModuleBuilder:this
          -4 (-1.58% of base) : 19450.dasm - System.Reflection.Metadata.Ecma335.MetadataBuilder:AddMethodDefinition(int,int,System.Reflection.Metadata.StringHandle,System.Reflection.Metadata.BlobHandle,int,System.Reflection.Metadata.ParameterHandle):System.Reflection.Metadata.MethodDefinitionHandle:this
        -156 (-1.06% of base) : 15598.dasm - DynamicClass:_DynamicMethod3(byref,int):MicroBenchmarks.Serializers.MyEventsListerItem
          -4 (-1.01% of base) : 19377.dasm - System.Reflection.Metadata.Ecma335.MetadataBuilder:AddAssemblyReference(System.Reflection.Metadata.StringHandle,System.Version,System.Reflection.Metadata.StringHandle,System.Reflection.Metadata.BlobHandle,int,System.Reflection.Metadata.BlobHandle):System.Reflection.Metadata.AssemblyReferenceHandle:this
        -130 (-0.89% of base) : 16200.dasm - DynamicClass:_DynamicMethod3(System.IO.TextReader,int):MicroBenchmarks.Serializers.MyEventsListerItem
         -46 (-0.67% of base) : 21098.dasm - DynamicClass:_DynamicMethod1(byref,int):MicroBenchmarks.Serializers.CollectionsOfPrimitives
         -21 (-0.66% of base) : 4229.dasm - HillClimbing:Update(int,double,int):System.ValueTuple`2[Int32,Int32]:this
          -8 (-0.64% of base) : 11030.dasm - Benchmarks.SIMD.RayTracer.RayTracer:CreateDefaultScene():Benchmarks.SIMD.RayTracer.Scene
         -44 (-0.60% of base) : 26146.dasm - DynamicClass:_DynamicMethod1(System.IO.TextReader,int):MicroBenchmarks.Serializers.CollectionsOfPrimitives
        -130 (-0.55% of base) : 24235.dasm - DynamicClass:_DynamicMethod1(System.IO.TextReader,int):MicroBenchmarks.Serializers.IndexViewModel
        -130 (-0.54% of base) : 13985.dasm - DynamicClass:_DynamicMethod1(byref,int):MicroBenchmarks.Serializers.IndexViewModel
          -2 (-0.19% of base) : 15808.dasm - <SyncReadAsync>d__5:MoveNext():this
Improvement example
        struct S
        {
            public byte b0;
            public byte b1;
            public byte b2;
            public byte b3;
            public byte b4;
            public byte b5;
            public byte b6;
            public byte b7;
        }

        [MethodImpl(MethodImplOptions.NoInlining)]
        static S TestInit(int a)
        {
            S s = new S();
          
            if (a == 0)
            {
                s = GetS1();
            }
            else if (a == 1)
            {
                s = GetS2();
            }
            else
            {
                S s2 = GetS1();
                s = s2;               
            }
            return s;
        }

for such test where we init/copy a struct that fits into a register and doesn't access its fields (7.0 maybe) and do not pass it as a call argument (next step, hopefully, 6.0)
we have such diffs:

; Final local variable assignments
;
;  V00 arg0         [V00,T00] (  4,  3.50)     int  ->  rcx
-;  V01 loc0         [V01,T01] (  4,  2.50)  struct ( 8) [rsp+20H]   do-not-enreg[SB] ld-addr-op
+;  V01 loc0         [V01,T01] (  4,  2.50)  struct ( 8) rax         ld-addr-op
;  V02 OutArgs      [V02    ] (  1,  1   )  lclBlk (32) [rsp+00H]   "OutgoingArgSpace"

-; Total bytes of code 57
+; Total bytes of code 38

G_M65053_IG01:        ; func=00, offs=000000H, size=0004H, bbWeight=1    PerfScore 0.25, gcrefRegs=00000000 {}, byrefRegs=00000000 {}, byref, nogc <-- Prolog IG

IN000e: 000000 sub      rsp, 40

G_M65053_IG02:        ; offs=000004H, size=0004H, bbWeight=1    PerfScore 1.25, gcrefRegs=00000000 {}, byrefRegs=00000000 {}, byref, isz

IN0001: 000004 test     ecx, ecx
IN0002: 000006 jne      SHORT G_M65053_IG04

G_M65053_IG03:        ; offs=000008H, size=000CH, bbWeight=0.50 PerfScore 2.00, gcrefRegs=00000000 {}, byrefRegs=00000000 {}, byref, isz

IN0003: 000008 call     TestStructFields.Program:GetS1():S
-IN0004: 00000D mov      qword ptr [V01 rsp+20H], rax
IN0005: 000012 jmp      SHORT G_M65053_IG06

G_M65053_IG04:        ; offs=000014H, size=0011H, bbWeight=0.50 PerfScore 2.63, gcrefRegs=00000000 {}, byrefRegs=00000000 {}, byref, isz

IN0006: 000014 cmp      ecx, 1
IN0007: 000017 jne      SHORT G_M65053_IG05
IN0008: 000019 call     TestStructFields.Program:GetS2():S
-IN0009: 00001E mov      qword ptr [V01 rsp+20H], rax
IN000a: 000023 jmp      SHORT G_M65053_IG06

G_M65053_IG05:        ; offs=000025H, size=000AH, bbWeight=0.50 PerfScore 1.00, gcrefRegs=00000000 {}, byrefRegs=00000000 {}, byref

IN000b: 000025 call     TestStructFields.Program:GetS1():S
-IN000c: 00002A mov      qword ptr [V01 rsp+20H], rax

G_M65053_IG06:        ; offs=00002FH, size=0005H, bbWeight=1    PerfScore 1.00, gcrefRegs=00000000 {}, byrefRegs=00000000 {}, byref

-IN000d: 00002F mov      rax, qword ptr [V01 rsp+20H]
+IN000a: 000020 nop

G_M65053_IG07:        ; offs=000034H, size=0005H, bbWeight=1    PerfScore 1.25, epilog, nogc, extend

IN000f: 000034 add      rsp, 40
IN0010: 000038 ret

@sandreenko sandreenko added os-windows arch-x64 area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI optimization labels Jul 1, 2021
@sandreenko sandreenko added this to the 6.0.0 milestone Jul 1, 2021
@sandreenko sandreenko marked this pull request as ready for review July 2, 2021 00:55
@sandreenko
Copy link
Contributor Author

/azp run runtime-coreclr jitstress, runtime-coreclr outerloop

@azure-pipelines
Copy link

Azure Pipelines successfully started running 2 pipeline(s).

@sandreenko
Copy link
Contributor Author

PTAL @BruceForstall @dotnet/jit-contrib

@sandreenko
Copy link
Contributor Author

The failures are not related, I will run jitstressregs as well later when the failing test is disabled.

@sandreenko
Copy link
Contributor Author

/azp run runtime-coreclr jitstressregs

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@sandreenko
Copy link
Contributor Author

/azp run runtime-coreclr outerloop

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

Copy link
Member

@kunalspathak kunalspathak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM...but I will let Bruce take another look.

@@ -1210,13 +1210,6 @@ void CodeGen::genUnspillRegIfNeeded(GenTree* tree)
assert(!varTypeIsGC(varDsc));
spillType = lclActualType;
}
#elif defined(TARGET_ARM64)
var_types targetType = unspillTree->gtType;
if (spillType != genActualType(varDsc->lvType) && !varTypeIsGC(spillType) && !varDsc->lvNormalizeOnLoad())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: could you also update the comment about to genActualType GetActualRegisterType()?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure, what would you like to add there? now it looks like:

//------------------------------------------------------------------------
// GetActualRegisterType: Determine an actual register type for this local var.
//
// Return Value:
// TYP_UNDEF if the layout is not enregistrable, the register type otherwise.
//
var_types LclVarDsc::GetActualRegisterType() const

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry - didn't notice your comment.

I was suggesting to update the comment right above where you changed the code:

// Load local variable from its home location.
// In most cases the tree type will indicate the correct type to use for the load.
// However, if it is NOT a normalizeOnLoad lclVar (i.e. NOT a small int that always gets
// widened when loaded into a register), and its size is not the same as genActualType of
// the type of the lclVar, then we need to change the type of the tree node when loading.
// This situation happens due to "optimizations" that avoid a cast and

and its size is not the same as genActualType of the type of the lclVar,
and its size is not the same as actual register type,

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks, fixed

@tannergooding
Copy link
Member

What's the plan for structs that contain floating point values or that might be more efficiently enregistered in a SIMD local (e.g. struct MyVector4 { float x, y, z, w; })?

@sandreenko
Copy link
Contributor Author

What's the plan for structs that contain floating point values or that might be more efficiently enregistered in a SIMD local (e.g. struct MyVector4 { float x, y, z, w; })?

MyVector4 will be enregistered as SIMD16 because there are no general registers with such size. The interesting cases are MyVector3 and MyVector2, the first is usually widened to SIMD16 but the second will be enregistered as long.
It is fine as long as we don't pass/return it using floating registers, and this one this PR is for x64 windows only, because we don't do such things there. On other platforms enregistration for such cases will be disabled until we fix code that moves incoming regs to their home locations, right now it does not support moves between reg files. After it is fixed we will think about choosing better base type for them.

Copy link
Member

@BruceForstall BruceForstall left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
arch-x64 area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI optimization os-windows
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants