Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] __m128 should be returned in XMM0 #1230

Open
wants to merge 8 commits into
base: master
from

Conversation

@cshung
Copy link
Member

cshung commented Dec 31, 2019

This is my attempt to hack the JIT so that it honors the Windows x64 calling convention in case an instance of Vector128<T> is returned.

It is currently being returned through a buffer provided by the caller, but it should be returned through XMM0 instead.

This should fix dotnet/coreclr#27002 from the JIT's perspective, which is part of dotnet/coreclr#15943 that is blocking #226 significantly.

This work is very early with very limited testing, therefore this change is sent out mostly to solicit feedback, please do not merge.

  • Make sure the code is under _TARGET_WINDOWS_ as well to fix cross-plat build/test failures.
  • Fix the reflection invoke scenario.
  • Make sure tests are clean
  • Confirm and ensure TYP_SIMD32 is returned through YMM0.

Test status:
On Windows, x64, debug, these test cases are failing.

Interop\COM\NETClients\IDispatch\NETClientIDispatch\NETClientIDispatch.cmd
JIT\HardwareIntrinsics\X86\Avx2\GatherVector128_r\GatherVector128_ro.cmd
JIT\HardwareIntrinsics\X86\Avx2\GatherVector128_r\GatherVector128_r.cmd

Here is the sample C# program.

using System;
using System.Runtime.Intrinsics;

namespace CoreLab
{
    class Program
    {
        static void Main(string[] args)
        {
            Caller();
        }

        static void Caller()
        {
            Callee();
        }

        static Vector128<double> Callee()
        {
            return default(Vector128<double>);
        }
    }
}

Before change

Full JIT dump is available here

Caller

IN0009: 000000 55                   push     rbp
IN000a: 000001 4883EC30             sub      rsp, 48
IN000b: 000005 488D6C2430           lea      rbp, [rsp+30H]
IN0001: 00000A 833D9F6D1F0000       cmp      dword ptr [(reloc 0x7fff50fbeaf0)], 0
IN0002: 000011 7405                 je       SHORT G_M17050_IG04
IN0003: 000013 E8E8B2A15E           call     CORINFO_HELP_DBG_IS_JUST_MY_CODE
IN0004: 000018 90                   nop      
IN0005: 000019 488D4DF0             lea      rcx, [rbp-10H]
IN0006: 00001D E87694FFFF           call     CoreLab.Program:Callee():System.Runtime.Intrinsics.Vector128`1[Double]
IN0007: 000022 90                   nop      
IN0008: 000023 90                   nop      
IN000c: 000024 488D6500             lea      rsp, [rbp]
IN000d: 000028 5D                   pop      rbp
IN000e: 000029 C3                   ret      

Callee

IN000f: 000000 55                   push     rbp
IN0010: 000001 4883EC40             sub      rsp, 64
IN0011: 000005 C5F877               vzeroupper 
IN0012: 000008 488D6C2440           lea      rbp, [rsp+40H]
IN0013: 00000D 33C0                 xor      rax, rax
IN0014: 00000F 488945F0             mov      qword ptr [rbp-10H], rax
IN0015: 000013 488945F8             mov      qword ptr [rbp-08H], rax
IN0016: 000017 488945E0             mov      qword ptr [rbp-20H], rax
IN0017: 00001B 488945E8             mov      qword ptr [rbp-18H], rax
IN0018: 00001F 48894D10             mov      bword ptr [rbp+10H], rcx
IN0001: 000023 833D466D1F0000       cmp      dword ptr [(reloc 0x7fff50fbeaf0)], 0
IN0002: 00002A 7405                 je       SHORT G_M60652_IG04
IN0003: 00002C E88FB2A15E           call     CORINFO_HELP_DBG_IS_JUST_MY_CODE
IN0004: 000031 90                   nop      
IN0005: 000032 C5F857C0             vxorps   xmm0, xmm0 (ECS:5, ACS:4)
IN0006: 000036 C5F92945F0           vmovapd  xmmword ptr [rbp-10H], xmm0 (ECS:6, ACS:5)
IN0007: 00003B C5F92845F0           vmovapd  xmm0, xmmword ptr [rbp-10H] (ECS:6, ACS:5)
IN0008: 000040 C5F92945E0           vmovapd  xmmword ptr [rbp-20H], xmm0 (ECS:6, ACS:5)
IN0009: 000045 90                   nop      
IN000a: 000046 EB04                 jmp      SHORT G_M60652_IG05
IN000b: 000048 488B4510             mov      rax, bword ptr [rbp+10H]
IN000c: 00004C C5F92845E0           vmovapd  xmm0, xmmword ptr [rbp-20H] (ECS:6, ACS:5)
IN000d: 000051 C5F91100             vmovupd  xmmword ptr [rax], xmm0 (ECS:5, ACS:4)
IN000e: 000055 488B4510             mov      rax, bword ptr [rbp+10H]
IN0019: 000059 488D6500             lea      rsp, [rbp]
IN001a: 00005D 5D                   pop      rbp
IN001b: 00005E C3                   ret      

After change

Full JIT dump is available here

Caller

IN000a: 000000 55                   push     rbp
IN000b: 000001 4883EC30             sub      rsp, 48
IN000c: 000005 488D6C2430           lea      rbp, [rsp+30H]
IN0001: 00000A 833DDF6D1F0000       cmp      dword ptr [(reloc 0x7fff4561eaf0)], 0
IN0002: 000011 7405                 je       SHORT G_M17050_IG04
IN0003: 000013 E828B3A15E           call     CORINFO_HELP_DBG_IS_JUST_MY_CODE
IN0004: 000018 90                   nop      
IN0005: 000019 E8BA94FFFF           call     CoreLab.Program:Callee():System.Runtime.Intrinsics.Vector128`1[Double]
IN0006: 00001E C5F828C0             vmovaps  xmm0, xrax (ECS:5, ACS:4)
IN0007: 000022 C5F92945F0           vmovapd  xmmword ptr [rbp-10H], xmm0 (ECS:6, ACS:5)
IN0008: 000027 90                   nop      
IN0009: 000028 90                   nop      
IN000d: 000029 488D6500             lea      rsp, [rbp]
IN000e: 00002D 5D                   pop      rbp
IN000f: 00002E C3                   ret      

Callee

IN000c: 000000 55                   push     rbp
IN000d: 000001 4883EC40             sub      rsp, 64
IN000e: 000005 C5F877               vzeroupper 
IN000f: 000008 488D6C2440           lea      rbp, [rsp+40H]
IN0010: 00000D 33C0                 xor      rax, rax
IN0011: 00000F 488945F0             mov      qword ptr [rbp-10H], rax
IN0012: 000013 488945F8             mov      qword ptr [rbp-08H], rax
IN0013: 000017 488945E0             mov      qword ptr [rbp-20H], rax
IN0014: 00001B 488945E8             mov      qword ptr [rbp-18H], rax
IN0001: 00001F 833D7A6D1F0000       cmp      dword ptr [(reloc 0x7fff4561eaf0)], 0
IN0002: 000026 7405                 je       SHORT G_M60652_IG04
IN0003: 000028 E8C3B2A15E           call     CORINFO_HELP_DBG_IS_JUST_MY_CODE
IN0004: 00002D 90                   nop      
IN0005: 00002E C5F857C0             vxorps   xmm0, xmm0 (ECS:5, ACS:4)
IN0006: 000032 C5F92945F0           vmovapd  xmmword ptr [rbp-10H], xmm0 (ECS:6, ACS:5)
IN0007: 000037 C5F92845F0           vmovapd  xmm0, xmmword ptr [rbp-10H] (ECS:6, ACS:5)
IN0008: 00003C C5F92945E0           vmovapd  xmmword ptr [rbp-20H], xmm0 (ECS:6, ACS:5)
IN0009: 000041 90                   nop      
IN000a: 000042 EB04                 jmp      SHORT G_M60652_IG05
IN000b: 000044 C5F92845E0           vmovapd  xmm0, xmmword ptr [rbp-20H] (ECS:6, ACS:5)
IN0015: 000049 488D6500             lea      rsp, [rbp]
IN0016: 00004D 5D                   pop      rbp
IN0017: 00004E C3                   ret      
@@ -960,6 +960,16 @@ var_types Compiler::getReturnTypeForStruct(CORINFO_CLASS_HANDLE clsHnd,
// The largest "primitive type" is MAX_PASS_SINGLEREG_BYTES
// so we can skip calling getPrimitiveTypeForStruct when we
// have a struct that is larger than that.

#ifdef _TARGET_AMD64_
if ((impNormStructType(clsHnd) == TYP_SIMD16))

This comment has been minimized.

Copy link
@CarolEidt

CarolEidt Jan 1, 2020

Member

This should probably be handled in getPrimitiveTypeForStruct but requires some work there, because IIRC windows and linux have different ABI requirements for args (and this same method is used for both).

This comment has been minimized.

Copy link
@mikedn

mikedn Jan 3, 2020

Collaborator

Maybe I'm missing something but won't this simple TYP_SIMD16 check result in Vector4 being returned in xmm0 instead of the ret buffer?

This comment has been minimized.

Copy link
@CarolEidt

CarolEidt Jan 3, 2020

Member

Yes, good point @mikedn, but I think that requires verification - my recollection is that the float4 struct in Visual C++ (which Vector4 was designed to be compatible with) is passed as a "regular" struct of 16 bytes, but I'm not entirely certain of that.

This comment has been minimized.

Copy link
@mikedn

mikedn Jan 3, 2020

Collaborator

I'm not sure if you have a particular float4 struct in mind but VC++ does return a normal 4 float struct via the return buffer:
https://gcc.godbolt.org/z/nSrf85

It returns it in XMM0 only if the vectorcall calling convention is used:
https://gcc.godbolt.org/z/HiKuxD

This comment has been minimized.

Copy link
@tannergooding

tannergooding Jan 3, 2020

Member

I believe, under previous discussions, we had said Vector4 was to be a user-defined struct of 4 floats, which would make it at best an HFA (being passed in 4 registers under __vectorcall and 2 registers under SysV)

This comment has been minimized.

Copy link
@tannergooding

tannergooding Jan 3, 2020

Member

This was also needed for back-compat as Vector2/3/4 aren't generic and so have been usable in P/Invoke declarations since first implemented (and so the expecation is it is treated as a user-defined struct).

This comment has been minimized.

Copy link
@tannergooding

tannergooding Jan 3, 2020

Member

It returns it in XMM0 only if the vectorcall calling convention is used:

It's worth noting that its passed/returned in 4 registers; not just xmm0:
https://gcc.godbolt.org/z/YQ_Hgd

This comment has been minimized.

Copy link
@mikedn

mikedn Jan 3, 2020

Collaborator

Right, I keep forgetting that while such structs are given special status they're not really treated as a single vector. That will only make implementing vectorcall in the JIT more complicated...

This comment has been minimized.

Copy link
@CarolEidt

CarolEidt Jan 3, 2020

Member

True, though we've already got HFAs for arm & arm64, so some of the hard work is already done. Maybe we can hope for / actually achieve some cleanup of ABI handling that can actually leverage that.

src/coreclr/src/jit/codegenxarch.cpp Outdated Show resolved Hide resolved
@@ -166,7 +166,8 @@ void CodeGen::genEmitGSCookieCheck(bool pushReg)
// returned in implicit RetBuf. If we reached here, we should not have
// a RetBuf and the return type should not be a struct.
assert(compiler->info.compRetBuffArg == BAD_VAR_NUM);
assert(!varTypeIsStruct(compiler->info.compRetNativeType));
// __m128 should be returned in XMM0

This comment has been minimized.

Copy link
@CarolEidt

CarolEidt Jan 1, 2020

Member

The comment above should be changed, rather than simply adding on a comment about __m128. Also, that is the name of the native type, and I think it's clearer to use the JIT name for the type in the comments (i.e. TYP_SIMD16) for consistency.

@cshung cshung closed this Jan 4, 2020
@cshung cshung reopened this Jan 4, 2020
@cshung cshung changed the title [WIP] __m128 should be returned in XMM0 __m128 should be returned in XMM0 Jan 4, 2020
@cshung cshung closed this Jan 4, 2020
@cshung cshung reopened this Jan 4, 2020
@cshung cshung changed the title __m128 should be returned in XMM0 [WIP] __m128 should be returned in XMM0 Jan 4, 2020
@cshung cshung added this to In progress in Hardware Intrinsics via automation Jan 5, 2020
@cshung cshung removed this from In progress in Hardware Intrinsics Jan 5, 2020
@cshung cshung force-pushed the cshung:dev/andrewau/typ-simd16-amd64-return branch 2 times, most recently from 9575e9c to 5a6d260 Jan 6, 2020
@cshung cshung force-pushed the cshung:dev/andrewau/typ-simd16-amd64-return branch from 5a6d260 to cdaa12c Jan 11, 2020
@cshung

This comment has been minimized.

Copy link
Member Author

cshung commented Jan 13, 2020

At this point, the PR is down to 3 test failures:

Interop\COM\NETClients\IDispatch\NETClientIDispatch\NETClientIDispatch.cmd
JIT\HardwareIntrinsics\X86\Avx2\GatherVector128_r\GatherVector128_ro.cmd
JIT\HardwareIntrinsics\X86\Avx2\GatherVector128_r\GatherVector128_r.cmd

The IDispatch test case is a mystery. I am able to repro the test failure locally, but I have a hard time understand how my change could impact that scenario. Vector128<T> is used in the test case for only string processing, but the test case that failed is about plain floating-point variables and returns. I am sure the native code got the right floating values as input and produced the right output, but somehow the return value is gone when it gets back to managed.

I couldn't repro the GatherVector128 test failures. I am sure my machine has AVX2, optimization is turned on for the libraries and tiered compilation is turned off. It just doesn't repro locally.

Some help to move forward is highly appreciated.

@cshung cshung force-pushed the cshung:dev/andrewau/typ-simd16-amd64-return branch 2 times, most recently from 011f0c8 to 0fc6bd2 Jan 18, 2020
@cshung cshung force-pushed the cshung:dev/andrewau/typ-simd16-amd64-return branch from 0fc6bd2 to 015dc8b Jan 24, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
5 participants
You can’t perform that action at this time.