Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Struct getters are generating unneccessary instructions on x64 when struct contains floats #7200

Closed
GeorchW opened this issue Jan 6, 2017 · 10 comments
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI enhancement Product code improvement that does NOT require public API changes/additions optimization tenet-performance Performance related issue
Milestone

Comments

@GeorchW
Copy link

GeorchW commented Jan 6, 2017

Hi folks,

I hope I'm writing this in the right area. I was comparing the JIT outputs of properties vs. fields, to see whether I should avoid auto-properties in high-performance code. Now, I've managed to create an example where the exact same data embedded in a struct creates different output than just using a primitive type. To be exact, it looks like the entire struct is copied once more than neccessary. This only happens when a float is embedded, and not for ints.

The behaviour when using float fields or properties directly seems to be as expected (no additional instructions when using properties).

I'm using VS2017 RC, .NET 4.6.2, compiling to x64, I think it should be RyuJIT, although I don't know how to find that out. The disassemblies were obtained via Alt+G.

Example code:

Disassembly:
With struct TestStruct { public float Value; }

         testClass.Field = testStruct;
 mov         qword ptr [rsp+10h],rdx  
 mov         eax,dword ptr [rsp+10h]  
 mov         dword ptr [rcx+8],eax  
 ret  
 
         testClass.Property = testStruct;
 sub         rsp,8  
 mov         qword ptr [rsp+18h],rdx  
 mov         eax,dword ptr [rsp+18h]  
 mov         dword ptr [rsp],eax  
 mov         eax,dword ptr [rsp]  
 mov         dword ptr [rcx+10h],eax  
 add         rsp,8  
 ret  
 
         Print(testClass.Field);
 sub         rsp,28h  
 mov         ecx,dword ptr [rcx+8]  
 call        00007FFC579A00A0  
 nop  
 add         rsp,28h  
 ret  
 
         Print(testClass.Property);
 sub         rsp,28h  
 mov         ecx,dword ptr [rcx+10h]  
 mov         dword ptr [rsp+20h],ecx  
 mov         ecx,dword ptr [rsp+20h]  
 call        00007FFC579A05A0  
 nop  
 add         rsp,28h  
 ret  

With struct TestStruct { public int Value; }

        testClass.Field = testStruct;
 mov         dword ptr [rcx+8],edx  
 ret  
 
         testClass.Property = testStruct;
 mov         eax,edx  
 mov         dword ptr [rcx+10h],eax  
 ret  
 
         Print(testClass.Field);
 sub         rsp,28h  
 mov         ecx,dword ptr [rcx+8]  
 call        00007FFC579C00A0  
 nop  
 add         rsp,28h  
 ret  
 
         Print(testClass.Property);
 sub         rsp,28h  
 mov         ecx,dword ptr [rcx+10h]  
 call        00007FFC579C0560  
 nop  
 add         rsp,28h  
 ret  

Full source:

using System;
using System.Runtime.CompilerServices;

struct TestStruct
{
    public float Value;
    //public int Value;
}
class TestClass
{
    public TestStruct Field;
    public TestStruct Property { get; set; }
}
class Program
{

    static void Main(string[] args)
    {
        TestClass testClass = new TestClass();
        TestFieldSetters(testClass, new TestStruct() { Value = 5 });
        TestPropSetters(testClass, new TestStruct() { Value = 3 });
        TestFieldGetters(testClass);
        TestPropGetters(testClass);
        Console.ReadLine();
    }
    [MethodImpl(MethodImplOptions.NoInlining)]
    static void TestFieldSetters(TestClass testClass, TestStruct testStruct)
    {
        testClass.Field = testStruct;
    }
    [MethodImpl(MethodImplOptions.NoInlining)]
    static void TestPropSetters(TestClass testClass, TestStruct testStruct)
    {
        testClass.Property = testStruct;
    }
    [MethodImpl(MethodImplOptions.NoInlining)]
    static void TestFieldGetters(TestClass testClass)
    {
        Print(testClass.Field);
    }
    [MethodImpl(MethodImplOptions.NoInlining)]
    static void TestPropGetters(TestClass testClass)
    {
        Print(testClass.Property);
    }
    [MethodImpl(MethodImplOptions.NoInlining)]
    static void Print(TestStruct testStruct)
    {
        Console.WriteLine(testStruct.Value);
    }
}

category:cq
theme:structs
skill-level:expert
cost:medium
impact:medium

@GeorchW GeorchW changed the title Struct getters are generating unneccessary instructions on x64 Struct getters are generating unneccessary instructions on x64 when struct conatins floats Jan 6, 2017
@mikedn
Copy link
Contributor

mikedn commented Jan 6, 2017

To be exact, it looks like the entire struct is copied once more than neccessary.

Yep, accessing a field via a property naturally results in a temporary copy and the JIT can't always eliminate that. See #4323 as well.

@GeorchW GeorchW changed the title Struct getters are generating unneccessary instructions on x64 when struct conatins floats Struct getters are generating unneccessary instructions on x64 when struct contains floats Jan 7, 2017
@RussKeldorph
Copy link
Contributor

@dotnet/jit-contrib

@msftgits msftgits transferred this issue from dotnet/coreclr Jan 31, 2020
@msftgits msftgits added this to the Future milestone Jan 31, 2020
@tannergooding
Copy link
Member

@sandreenko, @CarolEidt. I took a look at this as I found a codegen issue around simple struct wrappers. For example: ValueTuple<double> where the codegen ended up being pretty bad:

      C5FA11442440         vmovss   dword ptr [rsp+40H], xmm0
       8B442440             mov      eax, dword ptr [rsp+40H]
       89442460             mov      dword ptr [rsp+60H], eax
       8B442460             mov      eax, dword ptr [rsp+60H]
       89442438             mov      dword ptr [rsp+38H], eax
       C5F259442438         vmulss   xmm0, xmm1, dword ptr [rsp+38H]
       C5FA11442430         vmovss   dword ptr [rsp+30H], xmm0
       8B442430             mov      eax, dword ptr [rsp+30H]
       89442458             mov      dword ptr [rsp+58H], eax
       8B442458             mov      eax, dword ptr [rsp+58H]
       89442428             mov      dword ptr [rsp+28H], eax
       C5EA58442428         vaddss   xmm0, xmm2, dword ptr [rsp+28H]
       C5FA11442420         vmovss   dword ptr [rsp+20H], xmm0
       8B442420             mov      eax, dword ptr [rsp+20H]
       89442450             mov      dword ptr [rsp+50H], eax
       8B442450             mov      eax, dword ptr [rsp+50H]
       89442418             mov      dword ptr [rsp+18H], eax
       C5FA10442418         vmovss   xmm0, dword ptr [rsp+18H]
       C5FA5EC3             vdivss   xmm0, xmm0, xmm3
       C5FA11442410         vmovss   dword ptr [rsp+10H], xmm0
       8B442410             mov      eax, dword ptr [rsp+10H]
       89442448             mov      dword ptr [rsp+48H], eax
       8B442448             mov      eax, dword ptr [rsp+48H]
       89442408             mov      dword ptr [rsp+08H], eax
       C5FA10442408         vmovss   xmm0, dword ptr [rsp+08H]

We currently block struct promotion of structs with a single floating-point field: https://github.com/dotnet/runtime/blob/master/src/coreclr/src/jit/lclvars.cpp#L1910-L1925
This can be trivially resolved by adjusting the check to include compiler->compDoOldStructRetyping() in which case we generate optimal assembly provided everything is inlined:

       C5FA59C1             vmulss   xmm0, xmm0, xmm1
       C5FA58C2             vaddss   xmm0, xmm0, xmm2
       C5FA5EC3             vdivss   xmm0, xmm0, xmm3

For the case where things aren't inlined and where the wrapper struct is either a method return or method parameter then there are at least two remaining issues.

The first is that genFnPrologCalleeRegArgs makes some assumptions that the lclVarDsc type and the method parameter type should line up. It looks to largely account for this on UNIX_AMD64_ABI but not on the other platforms. Additionally, it doesn't look to be super efficient and looks up or recomputes information that is already available in regArgTab in a few places.

  • This is namely impactful when we have an incoming arg that needs to be immediately spilled, such as static Tuple<T1> ToTuple<T1>(this ValueTuple<T1> value) { return Tuple.Create(value.Item1); }
  • I did a very rough fix of this locally which seems to work but it might not cover everything and could likely do with a cleanup to make sure it is done properly

The other issue is that the register allocator doesn't currently expect for floating-point values to be in integer registers so we will get assets in places like: https://github.com/dotnet/runtime/blob/master/src/coreclr/src/jit/lsrabuild.cpp#L599 because theInterval->registerType is TYP_FLOAT but mask is ECX

  • This is impactful to methods like static float Test(Simple s) { return s.x; } where we would ideally just generate movd xmm0, eax

The reverse of going from a float to a wrapper type looks to work just fine provided the first two issues are fixed: static Simple Test(float value) { return new Simple { x = value }; } generates C5F97EC0 vmovd eax, xmm0

@sandreenko
Copy link
Contributor

@tannergooding thanks for the analysis, I mostly agree with you conclusions, a few comments

The other issue is that the register allocator doesn't currently expect for floating-point values to be in integer registers

I would not want to retype the struct as its only field and fix it in LSRA, I hope to delete all code under compDoOldStructRetyping() and do not go back to retyping. With #36862 we will have support for:
LCL_VAR struct v01 = CALL() when struct's fields correspond 1<->1 with ABI return registers, for example, when it is a one field struct, so such calls won't block promotion. The next step could be to do the same for arguments, so CALL(LCL_VAR struct v01) doesn't not block promotion for LCL_VAR struct v01 as well and it will solve the issue in my understanding.

@tannergooding
Copy link
Member

@sandreenko, I think I might actually be missing something here....

When we do struct promotion, we create a LclVarDsc for each field of the struct, so it isn't necessarily retyping as the LclVar type matches the field type.
In fact, this all lines up pretty well except for when the parent struct is an argument or return type.

When the parent struct is an argument or return type, then it might start out in a register that doesn't "match" the LclVar type and that is when the oddness starts.
For example, struct S { float x; } is passed in an integer register on Windows x64. Likewise, struct S { float x; int y; } or struct S { int x; float y; } is passed in a single integer register on Unix and Windows x64: https://godbolt.org/z/syGcgt

So, in the above scenario you might have something in ECX, or possibly even the upper bits of RCX, which needs to move to XMM0 and that is when the issues really appear to start.

@CarolEidt
Copy link
Contributor

@tannergooding - the retyping currently happens in order to work around the register file mismatches. But it causes may problems of its own. So once we've got the struct handling the way we'd like it to be, for your static float Test(Simple s) { return s.x; } you would allow struct promotion, and just handle that in the prolog. That is the arg variable goes away and gets replaced with s.x. Then when the prolog is generated, it does the appropriate move between register files. That's the way we currently handle SIMD types that are passed in multiple registers - they're reassembled in the prolog.

Other cases, where you may still have references to the entire struct, those would be handled in Lowering where it would insert the appropriate BITCAST node where needed. E.g. if I had (same struct, and s.x is promoted):

s = call
return s.x

The IR after Lowering would look something like:

Vs.x = BITCAST(call,float)
return Vs.x

The BITCAST does the move from general purpose to floating point regster.

More complex cases (multiple fields that don't match the registers) that require extracting fields, will not be supported right away, but shouldn't be too hard once we get the next couple of rounds of changes in.

@CarolEidt
Copy link
Contributor

While the Program:TestPropSetters(TestClass,TestStruct) case has improved, in all cases the struct is still stored to the stack and reloaded, rather than a simple register file move.

@CarolEidt CarolEidt modified the milestones: Future, 6.0.0 Oct 27, 2020
@JulieLeeMSFT JulieLeeMSFT added the needs-further-triage Issue has been initially triaged, but needs deeper consideration or reconsideration label Mar 23, 2021
@sandreenko sandreenko removed the needs-further-triage Issue has been initially triaged, but needs deeper consideration or reconsideration label Apr 12, 2021
@sandreenko sandreenko modified the milestones: 6.0.0, 7.0.0 Jul 9, 2021
@AndyAyersMS
Copy link
Member

Suspect we won't get to this in 7.0, so moving to future.

@sandreenko should we unassign you?

@AndyAyersMS AndyAyersMS modified the milestones: 7.0.0, Future Apr 28, 2022
@sandreenko sandreenko removed their assignment Apr 28, 2022
@jakobbotsch
Copy link
Member

Codegen today on main:

; Assembly listing for method Program:TestFieldSetters(TestClass,TestStruct)
; Emitting BLENDED_CODE for X64 CPU with AVX - Windows
; optimized code
; rsp based frame
; partially interruptible
; No PGO data
; Final local variable assignments
;
;  V00 arg0         [V00,T00] (  3,  3   )     ref  ->  rcx         class-hnd single-def
;  V01 arg1         [V01,T01] (  3,  3   )  struct ( 8) rdx         single-def
;# V02 OutArgs      [V02    ] (  1,  1   )  lclBlk ( 0) [rsp+00H]   "OutgoingArgSpace"
;
; Lcl frame size = 0

G_M46470_IG01:
						;; size=0 bbWeight=1    PerfScore 0.00
G_M46470_IG02:
       mov      dword ptr [rcx+08H], edx
						;; size=3 bbWeight=1    PerfScore 1.00
G_M46470_IG03:
       ret      
						;; size=1 bbWeight=1    PerfScore 1.00

; Total bytes of code 4, prolog size 0, PerfScore 2.40, instruction count 2, allocated bytes for code 4 (MethodHash=8e384a79) for method Program:TestFieldSetters(TestClass,TestStruct)
; ============================================================
; Assembly listing for method Program:TestPropSetters(TestClass,TestStruct)
; Emitting BLENDED_CODE for X64 CPU with AVX - Windows
; optimized code
; rsp based frame
; partially interruptible
; No PGO data
; 0 inlinees with PGO data; 1 single block inlinees; 0 inlinees without PGO data
; Final local variable assignments
;
;  V00 arg0         [V00,T00] (  3,  3   )     ref  ->  rcx         class-hnd single-def
;  V01 arg1         [V01,T01] (  3,  3   )  struct ( 8) rdx         single-def
;# V02 OutArgs      [V02    ] (  1,  1   )  lclBlk ( 0) [rsp+00H]   "OutgoingArgSpace"
;* V03 tmp1         [V03    ] (  0,  0   )  struct ( 8) zero-ref    "Inlining Arg"
;
; Lcl frame size = 0

G_M59705_IG01:
						;; size=0 bbWeight=1    PerfScore 0.00
G_M59705_IG02:
       mov      dword ptr [rcx+0CH], edx
						;; size=3 bbWeight=1    PerfScore 1.00
G_M59705_IG03:
       ret      
						;; size=1 bbWeight=1    PerfScore 1.00

; Total bytes of code 4, prolog size 0, PerfScore 2.40, instruction count 2, allocated bytes for code 4 (MethodHash=ccab16c6) for method Program:TestPropSetters(TestClass,TestStruct)
; ============================================================
; Assembly listing for method Program:TestFieldGetters(TestClass)
; Emitting BLENDED_CODE for X64 CPU with AVX - Windows
; optimized code
; rsp based frame
; partially interruptible
; No PGO data
; Final local variable assignments
;
;  V00 arg0         [V00,T00] (  3,  3   )     ref  ->  rcx         class-hnd single-def
;  V01 OutArgs      [V01    ] (  1,  1   )  lclBlk (32) [rsp+00H]   "OutgoingArgSpace"
;
; Lcl frame size = 40

G_M35359_IG01:
       sub      rsp, 40
						;; size=4 bbWeight=1    PerfScore 0.25
G_M35359_IG02:
       mov      ecx, dword ptr [rcx+08H]
       call     [Program:Print(TestStruct)]
       nop      
						;; size=10 bbWeight=1    PerfScore 5.25
G_M35359_IG03:
       add      rsp, 40
       ret      
						;; size=5 bbWeight=1    PerfScore 1.25

; Total bytes of code 19, prolog size 4, PerfScore 8.65, instruction count 6, allocated bytes for code 19 (MethodHash=59bc75e0) for method Program:TestFieldGetters(TestClass)
; ============================================================
; Assembly listing for method Program:TestPropGetters(TestClass)
; Emitting BLENDED_CODE for X64 CPU with AVX - Windows
; optimized code
; rsp based frame
; partially interruptible
; No PGO data
; 0 inlinees with PGO data; 1 single block inlinees; 0 inlinees without PGO data
; Final local variable assignments
;
;  V00 arg0         [V00,T00] (  3,  3   )     ref  ->  rcx         class-hnd single-def
;  V01 OutArgs      [V01    ] (  1,  1   )  lclBlk (32) [rsp+00H]   "OutgoingArgSpace"
;  V02 tmp1         [V02,T01] (  2,  4   )  struct ( 8) [rsp+20H]   do-not-enreg[S] "struct address for call/obj"
;
; Lcl frame size = 40

G_M21632_IG01:
       sub      rsp, 40
						;; size=4 bbWeight=1    PerfScore 0.25
G_M21632_IG02:
       mov      ecx, dword ptr [rcx+0CH]
       mov      dword ptr [rsp+20H], ecx
       mov      ecx, dword ptr [rsp+20H]
       call     [Program:Print(TestStruct)]
       nop      
						;; size=18 bbWeight=1    PerfScore 7.25
G_M21632_IG03:
       add      rsp, 40
       ret      
						;; size=5 bbWeight=1    PerfScore 1.25

; Total bytes of code 27, prolog size 4, PerfScore 11.45, instruction count 8, allocated bytes for code 27 (MethodHash=a0acab7f) for method Program:TestPropGetters(TestClass)
; ============================================================

Only TestPropGetters looks to have an unnecessary copy now.

@jakobbotsch
Copy link
Member

The above last case was (indirectly) fixed by dad39c7 (and we also started promoting structs wrapping floats in #84627).

@ghost ghost locked as resolved and limited conversation to collaborators Jul 17, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI enhancement Product code improvement that does NOT require public API changes/additions optimization tenet-performance Performance related issue
Projects
None yet
Development

No branches or pull requests

10 participants