Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement Vector.Ceiling / Vector.Floor #31993

Open
wants to merge 11 commits into
base: master
from

Conversation

@Gnbrkm41
Copy link
Collaborator

Gnbrkm41 commented Feb 9, 2020

Fixes #20509

Utilises ROUNDPS/ROUNDPD on x64, FRINTP/FRINTM on ARM64.
Example Code:

using System;
using System.Numerics;

public static class Program
{
    private static void Main()
    {
        Vector<float> vec = new Vector<float>(4.5f);
        vec = Vector.Ceiling(vec);
        Console.WriteLine(vec[0]);
    }
}

Assembly generated: (W10 Pro Insiders 19559 x64, Intel i7-8700)

*************** After end code gen, before unwindEmit()
G_M27646_IG01:        ; func=00, offs=000000H, size=001FH, bbWeight=1    PerfScore 7.00, gcrefRegs=00000000 {}, byrefRegs=00000000 {}, byref, nogc <-- Prolog IG

IN0009: 000000 push     rbp
IN000a: 000001 sub      rsp, 80
IN000b: 000005 vzeroupper
IN000c: 000008 lea      rbp, [rsp+50H]
IN000d: 00000D xor      rax, rax
IN000e: 00000F mov      qword ptr [V00 rbp-30H], rax
IN000f: 000013 mov      qword ptr [V00+0x8 rbp-28H], rax
IN0010: 000017 mov      qword ptr [V00+0x10 rbp-20H], rax
IN0011: 00001B mov      qword ptr [V00+0x18 rbp-18H], rax

G_M27646_IG02:        ; offs=00001FH, size=0029H, bbWeight=1    PerfScore 21.25, gcrefRegs=00000000 {}, byrefRegs=00000000 {}, byref, isz

IN0001: 00001F vbroadcastss ymm0, ymmword ptr[reloc @RWD00]
IN0002: 000028 vmovupd  ymmword ptr[V00 rbp-30H], ymm0
IN0003: 00002D vmovupd  ymm0, ymmword ptr[V00 rbp-30H]
IN0004: 000032 vroundps ymm0, ymm0, 10
IN0005: 000038 vmovupd  ymmword ptr[V00 rbp-30H], ymm0
IN0006: 00003D vmovss   xmm0, dword ptr [rbp-30H]
IN0007: 000042 call     System.Console:WriteLine(float)
IN0008: 000047 nop

G_M27646_IG03:        ; offs=000048H, size=0009H, bbWeight=1    PerfScore 3.00, epilog, nogc, extend

IN0012: 000048 vzeroupper
IN0013: 00004B lea      rsp, [rbp]
IN0014: 00004F pop      rbp
IN0015: 000050 ret

cc @tannergooding, @danmosemsft

Gnbrkm41 added 5 commits Feb 9, 2020
* Also fix the old path for the new structure
@Gnbrkm41

This comment has been minimized.

Copy link
Collaborator Author

Gnbrkm41 commented Feb 9, 2020

Currently, if COMPlus_EnableSse41=0 the assertion in lsraxarch.cpp L1975 compiler->getSIMDSupportLevel() >= SIMD_SSE4_Supported fails. Need to make it not treat as intrinsic if we don't have SSE4.1...

@Gnbrkm41 Gnbrkm41 changed the title Implement Vector.Ceiling / Vector.Floor [WIP] Implement Vector.Ceiling / Vector.Floor Feb 9, 2020
@Gnbrkm41

This comment has been minimized.

Copy link
Collaborator Author

Gnbrkm41 commented Feb 9, 2020

Any clues on what ##[error]Failed to build "CoreCLR component". is about for *nix builds? 🙄

@EgorBo

This comment has been minimized.

Copy link
Contributor

EgorBo commented Feb 9, 2020

I wonder if it makes sense to implement new Vector`1 APIs using System.Runtime.Intrinsics API in C# instead. So Mono will get these methods accelerated for free 🙂 cc @tannergooding

@Gnbrkm41

This comment has been minimized.

Copy link
Collaborator Author

Gnbrkm41 commented Feb 9, 2020

I wonder if it makes sense to implement new Vector`1 APIs using System.Runtime.Intrinsics API in C#

That would be very, very nice (as I'm struggling to implement the JIT intrinsics part) but I don't know if there was any progression on it; #952 have been implemented by dotnet/coreclr#27401 so it should be doable at this point.

#define ROUNDPS_TO_NEAREST_IMM 0b1000;
#define ROUNDPS_TOWARD_NEGATIVE_INFINITY_IMM 0b1001;
#define ROUNDPS_TOWARD_POSITIVE_INFINITY_IMM 0b1010;
#define ROUNDPS_TOWARD_ZERO_IMM 0b1011;

This comment has been minimized.

Copy link
@EgorBo

EgorBo Feb 9, 2020

Contributor

I guess now these magic numbers can be replaced with these named constants

This comment has been minimized.

Copy link
@Gnbrkm41

Gnbrkm41 Feb 9, 2020

Author Collaborator

Where do you think those macros should go to, in order for those constants to be shared across simdcodegenxarch.cpp and codegenxarch.cpp? As someone that hadn't done a lot of C++ I'm not sure if it can go to codegen.h or if there's a better place.

@Gnbrkm41

This comment has been minimized.

Copy link
Collaborator Author

Gnbrkm41 commented Feb 9, 2020

Changed to managed implementation w/ HWIntrinsics, and it looks quite promising:

*************** After end code gen, before unwindEmit()
G_M27646_IG01:        ; func=00, offs=000000H, size=0007H, bbWeight=1    PerfScore 1.25, gcrefRegs=00000000 {}, byrefRegs=00000000 {}, byref, nogc <-- Prolog IG

IN0005: 000000 sub      rsp, 40
IN0006: 000004 vzeroupper

G_M27646_IG02:        ; offs=000007H, size=0015H, bbWeight=1    PerfScore 12.25, gcrefRegs=00000000 {}, byrefRegs=00000000 {}, byref

IN0001: 000007 vbroadcastss ymm0, ymmword ptr[reloc @RWD00]
IN0002: 000010 vroundps ymm0, ymm0, 10
IN0003: 000016 call     System.Console:WriteLine(float)
IN0004: 00001B nop

G_M27646_IG03:        ; offs=00001CH, size=0008H, bbWeight=1    PerfScore 2.25, epilog, nogc, extend

IN0007: 00001C vzeroupper
IN0008: 00001F add      rsp, 40
IN0009: 000023 ret

Here's what the JIT intrinsics version look on checked build:

*************** After end code gen, before unwindEmit()
G_M27646_IG01:        ; func=00, offs=000000H, size=0007H, bbWeight=1    PerfScore 1.25, gcrefRegs=00000000 {}, byrefRegs=00000000 {}, byref, nogc <-- Prolog IG

IN0005: 000000 sub      rsp, 40
IN0006: 000004 vzeroupper

G_M27646_IG02:        ; offs=000007H, size=0015H, bbWeight=1    PerfScore 12.25, gcrefRegs=00000000 {}, byrefRegs=00000000 {}, byref

IN0001: 000007 vbroadcastss ymm0, ymmword ptr[reloc @RWD00]
IN0002: 000010 vroundps ymm0, ymm0, 10
IN0003: 000016 call     System.Console:WriteLine(float)
IN0004: 00001B nop

G_M27646_IG03:        ; offs=00001CH, size=0008H, bbWeight=1    PerfScore 2.25, epilog, nogc, extend

IN0007: 00001C vzeroupper
IN0008: 00001F add      rsp, 40
IN0009: 000023 ret

So it actually ends up generating identical code.

@Gnbrkm41

This comment has been minimized.

Copy link
Collaborator Author

Gnbrkm41 commented Feb 9, 2020

One concern I have regarding managed implementation at the moment is that we can't quite utilise SIMD instructions this way on ARM machines, since most of the ARM intrinsics aren't exposed yet. (Most? at least FRINTP/FRINTM does not appear to have been exposed) Otherwise, IMO managed implementations are ways better due to readability / lack of complexity by having it in JIT layer.

@tannergooding

This comment has been minimized.

Copy link
Member

tannergooding commented Feb 9, 2020

I wonder if it makes sense to implement new Vector`1 APIs using System.Runtime.Intrinsics API in C# instead. So Mono will get these methods accelerated for free

I had previously attempted this for some methods in: dotnet/coreclr#27483. @jkotas, @CarolEidt, and @AndyAyersMS had raised concerns about there being more complex trees and various getting missed because of this.
I know there have been a few improvements to inlining since then, but they would still need to give sign-off before I would feel comfortable giving the go ahead.

One concern I have regarding managed implementation at the moment is that we can't quite utilise SIMD instructions this way on ARM machines, since most of the ARM intrinsics aren't exposed yet.

The ARM intrinsics are still a WIP and we haven't started the general porting of existing x86 code paths to have ARM equivalents yet. It is fine, IMO, to just accelerate x86 right now provided we have a tracking issue to also accelerate ARM64 once support for the relevant instructions is added.

@Gnbrkm41

This comment has been minimized.

Copy link
Collaborator Author

Gnbrkm41 commented Feb 9, 2020

If we decide to implement in the JIT, how should it be handled for machines without SSE4.1? I'm not really seeing a good way to vectorise this with a set of other SIMD instructions.

@EgorBo

This comment has been minimized.

Copy link
Contributor

EgorBo commented Feb 9, 2020

If we decide to implement in the JIT, how should it be handled for machines without SSE4.1? I'm not really seeing a good way to vectorise this with a set of other SIMD instructions.

simply call Math.Ceiling/Math.Floor for each component?

I had previously attempted this for some methods in: dotnet/coreclr#27483. @jkotas, @CarolEidt, and @AndyAyersMS had raised concerns about there being more complex trees and various getting missed because of this.

I meant only for new members (for a start).

@tannergooding

This comment has been minimized.

Copy link
Member

tannergooding commented Feb 9, 2020

If we decide to implement in the JIT, how should it be handled for machines without SSE4.1?

It should likely just be done in software. I would not feel comfortable codifying the SSE2 fallback for vectorized Ceil/Floor in the JIT.

The scalar implementation of Ceil/Floor for x64 Windows is essentially:

For pre-SSE4.1 hardware, you would need to essentially take this algorithm and vectorize it (which, at first glance, should just involve a couple of masks and or'ing the results of the success/failure code paths together).

That being said, I'm also not particularly concerned about that code path. SSE4.1 came out in ~2007 and so is nearly 13 years old. Even the x86 emulation layer in Windows 10 on ARM supports these instructions. Additionally, for S.P.Corelib with R2R and some AOT scenarios there is support for the SSE4.1 code path via a runtime check against a cached CPUID value; so the non-SSE4.1 codepath won't be hit in that scenario either. (Happy to hear input from others if there is some use-case/scenario I am missing).

@tannergooding

This comment has been minimized.

Copy link
Member

tannergooding commented Feb 9, 2020

I meant only for new members (for a start).

I think its the same in either scenario. There are many benefits to having the implementations in managed code; including lowering the entry barrier, faster iteration, improved codegen in several scenarios, and easier sharing with other runtimes but there are some drawbacks with the current JIT that might make it unacceptable, even if only done for a few methods initially.

We'd need sign-off from the people I tagged above before we could start taking that route.

Gnbrkm41 added 2 commits Feb 9, 2020
@Gnbrkm41 Gnbrkm41 changed the title [WIP] Implement Vector.Ceiling / Vector.Floor Implement Vector.Ceiling / Vector.Floor Feb 9, 2020
@Gnbrkm41

This comment has been minimized.

Copy link
Collaborator Author

Gnbrkm41 commented Feb 9, 2020

Formatting Linux x64 is failing, but the artifacts generated is 0 bytes? 👀

Gnbrkm41 added 2 commits Feb 9, 2020
* Oops
@Gnbrkm41

This comment has been minimized.

Copy link
Collaborator Author

Gnbrkm41 commented Feb 9, 2020

....I definitely remember Visual Studio complaining about the #endif missing something at the end, but looks like it shouldn't be there. 😳

* Oops, again
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

5 participants
You can’t perform that action at this time.