New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate moving the types in `System.Numerics.Vectors` to use Hardware Intrinsics #31425

Open
tannergooding opened this Issue Jul 27, 2018 · 39 comments

Comments

7 participants
@tannergooding
Member

tannergooding commented Jul 27, 2018

Today, hardware acceleration of certain types in the System.Numerics.Vectors project is achieved via [Intrinsic] attributes and corresponding runtime support. There are a few downsides to this approach:

  • Minor tweaks to the backing implementation require shipping a new runtime
  • It is not obvious that the code has a hardware accelerated path (outside reading documentation)
  • Many of the types (such as the Matrix types) are not directly hardware accelerated

In netcoreapp30, the new Hardware Intrinsics feature is supposed to ship. This feature also allows hardware acceleration but at a much more fine-grained level (APIs typically have a 1-to-1 mapping with underlying instructions).

We should investigate moving the types in System.Numerics.Vectors to use hardware intrinsics as this has multiple potential benefits:

  • The hardware acceleration is still tied to the runtime, but minor tweaks can be made without having to ship a new runtime
  • The code having a hardware accelerated path becomes obvious as does the code that will be generated for a given platform/cpu
  • It will become much easier to add hardware acceleration support to types currently missing it (such as Matrix4x4)
@tannergooding

This comment has been minimized.

Show comment
Hide comment
@tannergooding

tannergooding Jul 27, 2018

Member

Related to #17214

Member

tannergooding commented Jul 27, 2018

Related to #17214

@tannergooding

This comment has been minimized.

Show comment
Hide comment
@tannergooding

tannergooding Jul 27, 2018

Member

I had done some basic/preliminary investigation some time back and the numbers looked promising: #25386 (comment)

Member

tannergooding commented Jul 27, 2018

I had done some basic/preliminary investigation some time back and the numbers looked promising: #25386 (comment)

@tannergooding

This comment has been minimized.

Show comment
Hide comment
@tannergooding

tannergooding Jul 27, 2018

Member

I am marking this up for grabs in case someone from the community wants to work on this. If you want to take a stab, feel free to say so and I can assign the issue out.

The "Microsoft DirectX Math" library is open-source, MIT-licensed, and has existing HWIntrinsic accelerated code for many of these APIs (but for Native code): https://github.com/microsoft/directxmath. This would likely be a good starting point as the algorithms have been around for a long time, have been in active use in production code, and are known to generally be correct and performant.

  • NOTE: In some cases the C# algorithm and the DirectX algorithm differ slightly, we should make note of these cases and determine the best course of action.
Member

tannergooding commented Jul 27, 2018

I am marking this up for grabs in case someone from the community wants to work on this. If you want to take a stab, feel free to say so and I can assign the issue out.

The "Microsoft DirectX Math" library is open-source, MIT-licensed, and has existing HWIntrinsic accelerated code for many of these APIs (but for Native code): https://github.com/microsoft/directxmath. This would likely be a good starting point as the algorithms have been around for a long time, have been in active use in production code, and are known to generally be correct and performant.

  • NOTE: In some cases the C# algorithm and the DirectX algorithm differ slightly, we should make note of these cases and determine the best course of action.
@tannergooding

This comment has been minimized.

Show comment
Hide comment
@tannergooding
Member

tannergooding commented Jul 27, 2018

@tannergooding

This comment has been minimized.

Show comment
Hide comment
@tannergooding

tannergooding Jul 27, 2018

Member

Also CC. @benaadams, who I know has looked at vectorizing/improving some of these types previously

Member

tannergooding commented Jul 27, 2018

Also CC. @benaadams, who I know has looked at vectorizing/improving some of these types previously

@danmosemsft

This comment has been minimized.

Show comment
Hide comment
@danmosemsft

danmosemsft Jul 27, 2018

Member

@hughbe in case he is interested in a meaty task.

Member

danmosemsft commented Jul 27, 2018

@hughbe in case he is interested in a meaty task.

@fiigii

This comment has been minimized.

Show comment
Hide comment
@fiigii

fiigii Jul 27, 2018

Contributor

@tannergooding Thanks for opening this issue. I agree that porting S.N.Vectors to managed code in HW intrinsic is much better to maintain. But before that, we may need to address some differences between Vector128/256<T> and Vector2/3/4/<T>. For example,

  1. We should have an internal API to convert Vector2/3/4/<T> from/to Vector128/256<T>
  2. Vector128/256<T> are immutable types, but Vector3/4 are mutable (fields of Vector3/4 can be assigned), how can we solve this difference? Or remain the implementation of some operations in the runtime?
Contributor

fiigii commented Jul 27, 2018

@tannergooding Thanks for opening this issue. I agree that porting S.N.Vectors to managed code in HW intrinsic is much better to maintain. But before that, we may need to address some differences between Vector128/256<T> and Vector2/3/4/<T>. For example,

  1. We should have an internal API to convert Vector2/3/4/<T> from/to Vector128/256<T>
  2. Vector128/256<T> are immutable types, but Vector3/4 are mutable (fields of Vector3/4 can be assigned), how can we solve this difference? Or remain the implementation of some operations in the runtime?
@fiigii

This comment has been minimized.

Show comment
Hide comment
@fiigii

fiigii Jul 27, 2018

Contributor

The "Microsoft DirectX Math" library is open-source,

@tannergooding Would you like to propose a "VectorMath" library for Vector128/256<T>?

Contributor

fiigii commented Jul 27, 2018

The "Microsoft DirectX Math" library is open-source,

@tannergooding Would you like to propose a "VectorMath" library for Vector128/256<T>?

@4creators

This comment has been minimized.

Show comment
Hide comment
@4creators

4creators Jul 27, 2018

Contributor

IMHO we should take the opportunity when rewriting Vector<T> implementation to:

  1. Use almost complete implementation of Intel HW Intrinsics up to AVX2 to expand Vector API surface with several missing operations which were requested by community or became possible due to HW intrinsics implementation (this would require API proposal, discussion and approval)

  2. Provide Arm64 HW intrinsics support where possible

cc @CarolEidt @danmosemsft @eerhardt @fiigii @tannergooding

Contributor

4creators commented Jul 27, 2018

IMHO we should take the opportunity when rewriting Vector<T> implementation to:

  1. Use almost complete implementation of Intel HW Intrinsics up to AVX2 to expand Vector API surface with several missing operations which were requested by community or became possible due to HW intrinsics implementation (this would require API proposal, discussion and approval)

  2. Provide Arm64 HW intrinsics support where possible

cc @CarolEidt @danmosemsft @eerhardt @fiigii @tannergooding

@tannergooding

This comment has been minimized.

Show comment
Hide comment
@tannergooding

tannergooding Jul 27, 2018

Member

We should have an internal API to convert Vector2/3/4/ from/to Vector128/256

I don't believe this is needed. You should be able to load a Vector2/3/4 into a Vector128<T>, perform all operations, and then convert back to the return type trivially and without much (if any) overhead.

Users will also have their own vector/numeric types, so we should focus on making sure that getting a Vector128<T> from a user defined struct is fast/efficient in general.

Vector128/256 are immutable types, but Vector3/4 are mutable (fields of Vector3/4 can be assigned), how can we solve this difference? Or remain the implementation of some operations in the runtime?

I'm not sure why you think this is an issue. Could you elaborate?

Member

tannergooding commented Jul 27, 2018

We should have an internal API to convert Vector2/3/4/ from/to Vector128/256

I don't believe this is needed. You should be able to load a Vector2/3/4 into a Vector128<T>, perform all operations, and then convert back to the return type trivially and without much (if any) overhead.

Users will also have their own vector/numeric types, so we should focus on making sure that getting a Vector128<T> from a user defined struct is fast/efficient in general.

Vector128/256 are immutable types, but Vector3/4 are mutable (fields of Vector3/4 can be assigned), how can we solve this difference? Or remain the implementation of some operations in the runtime?

I'm not sure why you think this is an issue. Could you elaborate?

@tannergooding

This comment has been minimized.

Show comment
Hide comment
@tannergooding

tannergooding Jul 27, 2018

Member

@tannergooding Would you like to propose a "VectorMath" library for Vector128/256?

There will likely need to be some minimal helper library that operates solely on Vector128<T> that the Vector2/3/4 code would call into. This would also be useful for other places in the framework that could take advantage of this.

It would need to be internal at first, and discussions on making it public could happen later.

Member

tannergooding commented Jul 27, 2018

@tannergooding Would you like to propose a "VectorMath" library for Vector128/256?

There will likely need to be some minimal helper library that operates solely on Vector128<T> that the Vector2/3/4 code would call into. This would also be useful for other places in the framework that could take advantage of this.

It would need to be internal at first, and discussions on making it public could happen later.

@tannergooding

This comment has been minimized.

Show comment
Hide comment
@tannergooding

tannergooding Jul 27, 2018

Member

IMHO we should take the opportunity when rewriting Vector implementation to:

Any port should be a two/three step process:

  1. Do a clean port of the existing code and make sure it is equally performant
  2. Do a rewrite of the existing APIs to make improvements, where possible
  3. Add new APIs, as required, and implement accordingly

Doing them together makes it harder to review the code and track the changes.

ARM/ARM64 support should definitely happen, but only after the APIs have been reviewed/approved/implemented.

Member

tannergooding commented Jul 27, 2018

IMHO we should take the opportunity when rewriting Vector implementation to:

Any port should be a two/three step process:

  1. Do a clean port of the existing code and make sure it is equally performant
  2. Do a rewrite of the existing APIs to make improvements, where possible
  3. Add new APIs, as required, and implement accordingly

Doing them together makes it harder to review the code and track the changes.

ARM/ARM64 support should definitely happen, but only after the APIs have been reviewed/approved/implemented.

@fiigii

This comment has been minimized.

Show comment
Hide comment
@fiigii

fiigii Jul 27, 2018

Contributor

should be able to load a Vector2/3/4 into a Vector128

That would be great if the runtime can eliminate the round trip. If not, the memory load/store may have considerable overhead when we have __vectorcall in the future.

Users will also have their own vector/numeric types, so we should focus on making sure that getting a Vector128 from a user defined struct is fast/efficient in general.

Agree! Related to dotnet/coreclr#19116

Contributor

fiigii commented Jul 27, 2018

should be able to load a Vector2/3/4 into a Vector128

That would be great if the runtime can eliminate the round trip. If not, the memory load/store may have considerable overhead when we have __vectorcall in the future.

Users will also have their own vector/numeric types, so we should focus on making sure that getting a Vector128 from a user defined struct is fast/efficient in general.

Agree! Related to dotnet/coreclr#19116

@tannergooding

This comment has been minimized.

Show comment
Hide comment
@tannergooding

tannergooding Jul 27, 2018

Member

That would be great if the runtime can eliminate the round trip.

I would certainly hope that we can.

From my view, anything that would prevent us from moving the System.Runtime.Vector project to use exclusively HWIntrinsics (outside the dynamic Vector<T> sizing feature) is a potential perf problem for users wanting to use HWIntrinsics themselves. So we want to get those types of things flagged, investigated, fixed, etc.

Member

tannergooding commented Jul 27, 2018

That would be great if the runtime can eliminate the round trip.

I would certainly hope that we can.

From my view, anything that would prevent us from moving the System.Runtime.Vector project to use exclusively HWIntrinsics (outside the dynamic Vector<T> sizing feature) is a potential perf problem for users wanting to use HWIntrinsics themselves. So we want to get those types of things flagged, investigated, fixed, etc.

@saucecontrol

This comment has been minimized.

Show comment
Hide comment
@saucecontrol

saucecontrol Aug 2, 2018

Member

anything that would prevent us from moving the System.Runtime.Vector project to use exclusively HWIntrinsics (outside the dynamic Vector sizing feature) is a potential perf problem for users wanting to use HWIntrinsics themselves

This seems like a good enough excuse for me to take this on if nobody else is interested 😉

I'm at a bit of a loss how to start, though. Even something as trivial as operator + on Vector2 is twisting my brain. As an intrinsic, the JIT knows whether the struct is currently in an xmm register and can emit just a addps. If it's not in a register, it loads it with movsd before emitting the addps. Assuming I pin this and take its address so I can explicitly load it with Sse2.LoadScalarVector128(double*), is the JIT containment logic smart enough to know it's already in a xmm register and skip that part? And would it do the same with Vector3 where the load process is movsd + movss + insertps?

The constructors are even more confusing to me since they may be a setup for further SIMD operations or the fields may be immediately accessed. The JIT seems to know what to do about that now, but I can't imagine how managed code could get that right.

As far as the porting process, would the goal be to replace all [intrinsic] methods first and then SIMD-ize the things that aren't using intrinsics today?

Member

saucecontrol commented Aug 2, 2018

anything that would prevent us from moving the System.Runtime.Vector project to use exclusively HWIntrinsics (outside the dynamic Vector sizing feature) is a potential perf problem for users wanting to use HWIntrinsics themselves

This seems like a good enough excuse for me to take this on if nobody else is interested 😉

I'm at a bit of a loss how to start, though. Even something as trivial as operator + on Vector2 is twisting my brain. As an intrinsic, the JIT knows whether the struct is currently in an xmm register and can emit just a addps. If it's not in a register, it loads it with movsd before emitting the addps. Assuming I pin this and take its address so I can explicitly load it with Sse2.LoadScalarVector128(double*), is the JIT containment logic smart enough to know it's already in a xmm register and skip that part? And would it do the same with Vector3 where the load process is movsd + movss + insertps?

The constructors are even more confusing to me since they may be a setup for further SIMD operations or the fields may be immediately accessed. The JIT seems to know what to do about that now, but I can't imagine how managed code could get that right.

As far as the porting process, would the goal be to replace all [intrinsic] methods first and then SIMD-ize the things that aren't using intrinsics today?

@tannergooding

This comment has been minimized.

Show comment
Hide comment
@tannergooding

tannergooding Aug 2, 2018

Member

@saucecontrol, my initial experimentation can be found here: tannergooding@f862540

, is the JIT containment logic smart enough to know it's already in a xmm register and skip that part

This is a case where, if the JIT isn't smart enough to do this, it probably should be. Most ABIs have some concept of homegenous vector aggregate values (structures that can be packed in one or more vector registers), and they should appropriately handle the transition between memory and register where possible.

As far as the porting process, would the goal be to replace all [intrinsic] methods first and then SIMD-ize the things that aren't using intrinsics today?

Yes, I think the initial goal should be to replace the [Intrinsic] methods (since they are actually intrinsic today) and then expand that to also include the other methods.

I would expect that you will want some kind of mini "helper" library that only operates on Vector128<float>. You would then implement the various Vector4 (or Vector2/Vector3) operations by:

  • Loading the Vector2/3/4 arguments into Vector128
  • Calling into the appropriate helper functions to perform the operation
  • Converting the Vector128 back into a Vector2/3/4 and returning

I think this is probably desirable as:

  • Algorithms that operate on Vector128 are generally reusable (including outside System.Numerics)
  • You don't want to have multiple transitions from Vector2/3 to Vector128 and back (as they are not always a cheap conversion, unless already in register)
  • You don't want to duplicate the same logic in multiple places (DotProduct, Length, and LengthSquared are all basically the same)
  • etc
Member

tannergooding commented Aug 2, 2018

@saucecontrol, my initial experimentation can be found here: tannergooding@f862540

, is the JIT containment logic smart enough to know it's already in a xmm register and skip that part

This is a case where, if the JIT isn't smart enough to do this, it probably should be. Most ABIs have some concept of homegenous vector aggregate values (structures that can be packed in one or more vector registers), and they should appropriately handle the transition between memory and register where possible.

As far as the porting process, would the goal be to replace all [intrinsic] methods first and then SIMD-ize the things that aren't using intrinsics today?

Yes, I think the initial goal should be to replace the [Intrinsic] methods (since they are actually intrinsic today) and then expand that to also include the other methods.

I would expect that you will want some kind of mini "helper" library that only operates on Vector128<float>. You would then implement the various Vector4 (or Vector2/Vector3) operations by:

  • Loading the Vector2/3/4 arguments into Vector128
  • Calling into the appropriate helper functions to perform the operation
  • Converting the Vector128 back into a Vector2/3/4 and returning

I think this is probably desirable as:

  • Algorithms that operate on Vector128 are generally reusable (including outside System.Numerics)
  • You don't want to have multiple transitions from Vector2/3 to Vector128 and back (as they are not always a cheap conversion, unless already in register)
  • You don't want to duplicate the same logic in multiple places (DotProduct, Length, and LengthSquared are all basically the same)
  • etc
@CarolEidt

This comment has been minimized.

Show comment
Hide comment
@CarolEidt

CarolEidt Aug 2, 2018

Member

Users will also have their own vector/numeric types, so we should focus on making sure that getting a Vector128 from a user defined struct is fast/efficient in general.

Agree! Related to dotnet/coreclr#19116

#19116 deals with vectors wrapped in a class, which is different from wrapping them in a struct. The former is what involves escape analysis or stack allocation, while the other involves appropriate struct promotion and copy elimination. The JIT has room for improvement in all of those areas.

Member

CarolEidt commented Aug 2, 2018

Users will also have their own vector/numeric types, so we should focus on making sure that getting a Vector128 from a user defined struct is fast/efficient in general.

Agree! Related to dotnet/coreclr#19116

#19116 deals with vectors wrapped in a class, which is different from wrapping them in a struct. The former is what involves escape analysis or stack allocation, while the other involves appropriate struct promotion and copy elimination. The JIT has room for improvement in all of those areas.

@fiigii

This comment has been minimized.

Show comment
Hide comment
@fiigii

fiigii Aug 2, 2018

Contributor

deals with vectors wrapped in a class, which is different from wrapping them in a struct.

Yes, in that issue, I also said

the current struct promotion also does not work with VectorPacket, so if changing VectorPacket to struct from class, that will generate so much memory copies and get worse performance.

@CarolEidt Do you think this is a good time to improve the struct promotion for struct-wrapped Vector128/256?

Contributor

fiigii commented Aug 2, 2018

deals with vectors wrapped in a class, which is different from wrapping them in a struct.

Yes, in that issue, I also said

the current struct promotion also does not work with VectorPacket, so if changing VectorPacket to struct from class, that will generate so much memory copies and get worse performance.

@CarolEidt Do you think this is a good time to improve the struct promotion for struct-wrapped Vector128/256?

@saucecontrol

This comment has been minimized.

Show comment
Hide comment
@saucecontrol

saucecontrol Aug 2, 2018

Member

Thanks, @tannergooding

I found my way over to your Vector4 prototype, and I'm relieved to find it matches up pretty closely with my mental model. I'll look into this a bit more in the next few days and see if I can commit to getting it done. Depends how much is already in place as far as tests and benchmarks.

Member

saucecontrol commented Aug 2, 2018

Thanks, @tannergooding

I found my way over to your Vector4 prototype, and I'm relieved to find it matches up pretty closely with my mental model. I'll look into this a bit more in the next few days and see if I can commit to getting it done. Depends how much is already in place as far as tests and benchmarks.

@benaadams

This comment has been minimized.

Show comment
Hide comment
@benaadams

benaadams Aug 2, 2018

Collaborator

Yes, I think the initial goal should be to replace the [Intrinsic] methods (since they are actually intrinsic today) and then expand that to also include the other methods.

Wouldn't it be better to replace the non-intrinsic methods first as that's actually a gain? Or is it to test that the current intrinsic methods match up when using the new intrinsics?

Collaborator

benaadams commented Aug 2, 2018

Yes, I think the initial goal should be to replace the [Intrinsic] methods (since they are actually intrinsic today) and then expand that to also include the other methods.

Wouldn't it be better to replace the non-intrinsic methods first as that's actually a gain? Or is it to test that the current intrinsic methods match up when using the new intrinsics?

@tannergooding

This comment has been minimized.

Show comment
Hide comment
@tannergooding

tannergooding Aug 2, 2018

Member

Or is it to test that the current intrinsic methods match up when using the new intrinsics?

@benaadams, right. I think we want to first make sure there is no loss in existing performance and then work on improving general performance on the rest of the functions after that.

Member

tannergooding commented Aug 2, 2018

Or is it to test that the current intrinsic methods match up when using the new intrinsics?

@benaadams, right. I think we want to first make sure there is no loss in existing performance and then work on improving general performance on the rest of the functions after that.

@CarolEidt

This comment has been minimized.

Show comment
Hide comment
@CarolEidt

CarolEidt Aug 2, 2018

Member

Do you think this is a good time to improve the struct promotion for struct-wrapped Vector128/256?

I think it would be great to do that; I don't think it rises to the top of my priority list, but I would be supportive of others working on it.

Member

CarolEidt commented Aug 2, 2018

Do you think this is a good time to improve the struct promotion for struct-wrapped Vector128/256?

I think it would be great to do that; I don't think it rises to the top of my priority list, but I would be supportive of others working on it.

@fiigii

This comment has been minimized.

Show comment
Hide comment
@fiigii

fiigii Aug 2, 2018

Contributor

@CarolEidt Thanks! Let me give a try next week (after I finish Avx2.Gather* intrinsic).

Contributor

fiigii commented Aug 2, 2018

@CarolEidt Thanks! Let me give a try next week (after I finish Avx2.Gather* intrinsic).

@tannergooding

This comment has been minimized.

Show comment
Hide comment
@tannergooding

tannergooding Aug 2, 2018

Member

@saucecontrol, thanks! Let me know if you do want to pick this up "more officially" and we can add you as a contributor and assign the issue out 😄.

Note: Being assigned the issue wouldn't mean that you have any obligation to actually complete the work or any kind of deadline for it. It is mostly to communicate that someone is "actively" working on it. If you decide to drop it at a later time, that is completely fine, and we will unassign the issue so that someone else can pick it up.

I would be on point for assisting if you have any questions/concerns or if you just need help (the same goes for anyone else if they decide they want to pick this up instead).

Member

tannergooding commented Aug 2, 2018

@saucecontrol, thanks! Let me know if you do want to pick this up "more officially" and we can add you as a contributor and assign the issue out 😄.

Note: Being assigned the issue wouldn't mean that you have any obligation to actually complete the work or any kind of deadline for it. It is mostly to communicate that someone is "actively" working on it. If you decide to drop it at a later time, that is completely fine, and we will unassign the issue so that someone else can pick it up.

I would be on point for assisting if you have any questions/concerns or if you just need help (the same goes for anyone else if they decide they want to pick this up instead).

@saucecontrol

This comment has been minimized.

Show comment
Hide comment
@saucecontrol

saucecontrol Aug 2, 2018

Member

@tannergooding, sounds good to me! If you don't mind answering the occasional stupid question on gitter, I should be able to get this done. Let's go ahead and make it official.

Member

saucecontrol commented Aug 2, 2018

@tannergooding, sounds good to me! If you don't mind answering the occasional stupid question on gitter, I should be able to get this done. Let's go ahead and make it official.

@tannergooding

This comment has been minimized.

Show comment
Hide comment
@tannergooding

tannergooding Aug 2, 2018

Member

Feel free to ask as many questions as needed 😄

You should have gotten an invitation and the issue has been assigned to you.

Member

tannergooding commented Aug 2, 2018

Feel free to ask as many questions as needed 😄

You should have gotten an invitation and the issue has been assigned to you.

@saucecontrol

This comment has been minimized.

Show comment
Hide comment
@saucecontrol

saucecontrol Aug 24, 2018

Member

I started coding on this tonight, and I ran into problems pretty quickly. Take the following test for example

public Vector2 AddemUp()
{
    var sum = Vector2.Zero;
    var vecs = ArrayOfVectors;

    for (int i = 0; i < vecs.Length; i++)
        sum += vecs[i];

    return sum;
}

With the SIMD intrinsics enabled, that for loop compiles to the following:

G_M32824_IG03:
       4C63C2               movsxd   r8, edx
       C4A17B1044C010       vmovsd   xmm0, qword ptr [rax+8*r8+16]
       C4E17B104C2428       vmovsd   xmm1, qword ptr [rsp+28H]
       C4E17058C8           vaddps   xmm1, xmm0
       C4E17B114C2428       vmovsd   qword ptr [rsp+28H], xmm1
       FFC2                 inc      edx
       3BCA                 cmp      ecx, edx
       7FDD                 jg       SHORT G_M32824_IG03

Which is not too bad. Would be better if it didn't spill vsum each time around, but it could be worse 😄

Then I started my prototype simply, with only op_Addition, which I defined thusly:

[MethodImpl(MethodImplOptions.AggressiveInlining)]
public static unsafe Vector2 operator +(Vector2 left, Vector2 right)
{
#if HAS_INTRINSICS
    if (Sse2.IsSupported)
    {
        Vector2 vres = default;

        var vleft = Sse.StaticCast<double, float>(Sse2.LoadScalarVector128((double*)&left.X));
        var vright = Sse.StaticCast<double, float>(Sse2.LoadScalarVector128((double*)&right.X));
        Sse2.StoreScalar((double*)&vres.X, Sse.StaticCast<float, double>(Sse.Add(vleft, vright)));

        return vres;
    }
#endif

    return new Vector2(left.X + right.X, left.Y + right.Y);
}

And that produces the following for the same loop:

G_M32825_IG03:
       C4E17A10442438       vmovss   xmm0, dword ptr [rsp+38H]
       C4E17A11442428       vmovss   dword ptr [rsp+28H], xmm0
       C4E17A1044243C       vmovss   xmm0, dword ptr [rsp+3CH]
       C4E17A1144242C       vmovss   dword ptr [rsp+2CH], xmm0
       4C63C2               movsxd   r8, edx
       4E8D44C010           lea      r8, bword ptr [rax+8*r8+16]
       C4C17A1000           vmovss   xmm0, dword ptr [r8]
       C4E17A11442420       vmovss   dword ptr [rsp+20H], xmm0
       C4C17A104004         vmovss   xmm0, dword ptr [r8+4]
       C4E17A11442424       vmovss   dword ptr [rsp+24H], xmm0
       C4E17857C0           vxorps   xmm0, xmm0
       C4E17A11442430       vmovss   dword ptr [rsp+30H], xmm0
       C4E17A11442434       vmovss   dword ptr [rsp+34H], xmm0
       C4E17A11442430       vmovss   dword ptr [rsp+30H], xmm0
       C4E17A11442434       vmovss   dword ptr [rsp+34H], xmm0
       4C8D442428           lea      r8, bword ptr [rsp+28H]
       C4C17B1000           vmovsd   xmm0, xmmword ptr [r8]
       4C8D442420           lea      r8, bword ptr [rsp+20H]
       C4C17B1008           vmovsd   xmm1, xmmword ptr [r8]
       4C8D442430           lea      r8, bword ptr [rsp+30H]
       C4E17858C1           vaddps   xmm0, xmm0, xmm1
       C4C17B1100           vmovsd   xmmword ptr [r8], xmm0
       4C8B442430           mov      r8, qword ptr [rsp+30H]
       4C89442438           mov      qword ptr [rsp+38H], r8
       FFC2                 inc      edx
       3BCA                 cmp      ecx, edx
       0F8F6BFFFFFF         jg       G_M32825_IG03

My movsds and addps came through nicely, but the stack thrashing is beyond my worst expectations.

I'm wondering whether I've done something wrong. In order to disable the Vector2 Intrinsics, I commented out the following in my copy of the JIT

https://github.com/dotnet/coreclr/blob/59ea9c4e1937ea44faddb31dc3f17f8f1001f1d3/src/jit/simd.cpp#L318-L326

Would that have side effects beyond disabling the specific [Intrinsic] methods? Like would that cause the struct to not be recognized as fitting in a SIMD8 when it otherwise would be enregistered? And if so, is there a better way to disable the Intrinsics while I'm testing?

Also, I'm wondering what will have to be done to get the JIT to recognize the equivalent of __vectorcall for those operator arguments. In my previous HWIntrinsics code, I've resorted to passing my vectors by ref, but we have to keep the existing contracts on the Vector types, so that's not an option here.

Member

saucecontrol commented Aug 24, 2018

I started coding on this tonight, and I ran into problems pretty quickly. Take the following test for example

public Vector2 AddemUp()
{
    var sum = Vector2.Zero;
    var vecs = ArrayOfVectors;

    for (int i = 0; i < vecs.Length; i++)
        sum += vecs[i];

    return sum;
}

With the SIMD intrinsics enabled, that for loop compiles to the following:

G_M32824_IG03:
       4C63C2               movsxd   r8, edx
       C4A17B1044C010       vmovsd   xmm0, qword ptr [rax+8*r8+16]
       C4E17B104C2428       vmovsd   xmm1, qword ptr [rsp+28H]
       C4E17058C8           vaddps   xmm1, xmm0
       C4E17B114C2428       vmovsd   qword ptr [rsp+28H], xmm1
       FFC2                 inc      edx
       3BCA                 cmp      ecx, edx
       7FDD                 jg       SHORT G_M32824_IG03

Which is not too bad. Would be better if it didn't spill vsum each time around, but it could be worse 😄

Then I started my prototype simply, with only op_Addition, which I defined thusly:

[MethodImpl(MethodImplOptions.AggressiveInlining)]
public static unsafe Vector2 operator +(Vector2 left, Vector2 right)
{
#if HAS_INTRINSICS
    if (Sse2.IsSupported)
    {
        Vector2 vres = default;

        var vleft = Sse.StaticCast<double, float>(Sse2.LoadScalarVector128((double*)&left.X));
        var vright = Sse.StaticCast<double, float>(Sse2.LoadScalarVector128((double*)&right.X));
        Sse2.StoreScalar((double*)&vres.X, Sse.StaticCast<float, double>(Sse.Add(vleft, vright)));

        return vres;
    }
#endif

    return new Vector2(left.X + right.X, left.Y + right.Y);
}

And that produces the following for the same loop:

G_M32825_IG03:
       C4E17A10442438       vmovss   xmm0, dword ptr [rsp+38H]
       C4E17A11442428       vmovss   dword ptr [rsp+28H], xmm0
       C4E17A1044243C       vmovss   xmm0, dword ptr [rsp+3CH]
       C4E17A1144242C       vmovss   dword ptr [rsp+2CH], xmm0
       4C63C2               movsxd   r8, edx
       4E8D44C010           lea      r8, bword ptr [rax+8*r8+16]
       C4C17A1000           vmovss   xmm0, dword ptr [r8]
       C4E17A11442420       vmovss   dword ptr [rsp+20H], xmm0
       C4C17A104004         vmovss   xmm0, dword ptr [r8+4]
       C4E17A11442424       vmovss   dword ptr [rsp+24H], xmm0
       C4E17857C0           vxorps   xmm0, xmm0
       C4E17A11442430       vmovss   dword ptr [rsp+30H], xmm0
       C4E17A11442434       vmovss   dword ptr [rsp+34H], xmm0
       C4E17A11442430       vmovss   dword ptr [rsp+30H], xmm0
       C4E17A11442434       vmovss   dword ptr [rsp+34H], xmm0
       4C8D442428           lea      r8, bword ptr [rsp+28H]
       C4C17B1000           vmovsd   xmm0, xmmword ptr [r8]
       4C8D442420           lea      r8, bword ptr [rsp+20H]
       C4C17B1008           vmovsd   xmm1, xmmword ptr [r8]
       4C8D442430           lea      r8, bword ptr [rsp+30H]
       C4E17858C1           vaddps   xmm0, xmm0, xmm1
       C4C17B1100           vmovsd   xmmword ptr [r8], xmm0
       4C8B442430           mov      r8, qword ptr [rsp+30H]
       4C89442438           mov      qword ptr [rsp+38H], r8
       FFC2                 inc      edx
       3BCA                 cmp      ecx, edx
       0F8F6BFFFFFF         jg       G_M32825_IG03

My movsds and addps came through nicely, but the stack thrashing is beyond my worst expectations.

I'm wondering whether I've done something wrong. In order to disable the Vector2 Intrinsics, I commented out the following in my copy of the JIT

https://github.com/dotnet/coreclr/blob/59ea9c4e1937ea44faddb31dc3f17f8f1001f1d3/src/jit/simd.cpp#L318-L326

Would that have side effects beyond disabling the specific [Intrinsic] methods? Like would that cause the struct to not be recognized as fitting in a SIMD8 when it otherwise would be enregistered? And if so, is there a better way to disable the Intrinsics while I'm testing?

Also, I'm wondering what will have to be done to get the JIT to recognize the equivalent of __vectorcall for those operator arguments. In my previous HWIntrinsics code, I've resorted to passing my vectors by ref, but we have to keep the existing contracts on the Vector types, so that's not an option here.

@CarolEidt

This comment has been minimized.

Show comment
Hide comment
@CarolEidt

CarolEidt Aug 24, 2018

Member

@saucecontrol - by commenting out the code that recognizes the Vector2 type, the JIT will treat them as regular structs. A better way would be to remove the intrinsics that you are implementing from hwintrinsiclistxarch.h.

You will run into additional issues with Vector2. Because 8 byte structs are passed in a general purpose register on x64/Windows, the JIT doesn't enregister them. This is excessively conservative (it should just move them as/when needed), and is a "TODO" item to fix.

Member

CarolEidt commented Aug 24, 2018

@saucecontrol - by commenting out the code that recognizes the Vector2 type, the JIT will treat them as regular structs. A better way would be to remove the intrinsics that you are implementing from hwintrinsiclistxarch.h.

You will run into additional issues with Vector2. Because 8 byte structs are passed in a general purpose register on x64/Windows, the JIT doesn't enregister them. This is excessively conservative (it should just move them as/when needed), and is a "TODO" item to fix.

@tannergooding

This comment has been minimized.

Show comment
Hide comment
@tannergooding

tannergooding Aug 24, 2018

Member

@CarolEidt, in this case I think it would need to be cut from simdintrinsiclist.h, hwintrinsiclistxarch is for the HWIntrinsics, not for System.Numerics.Vector2

@saucecontrol, you will also get slightly better codegen if you don't initialize vres (which cuts out 22 bytes).

Vector2 vres;

var vleft = Sse.StaticCast<double, float>(Sse2.LoadScalarVector128((double*)&left.X));
var vright = Sse.StaticCast<double, float>(Sse2.LoadScalarVector128((double*)&right.X));
Sse2.StoreScalar((double*)&vres, Sse.StaticCast<float, double>(Sse.Add(vleft, vright)));

return vres;

which produces:

G_M6507_IG03:
       vmovsd   xmm0, qword ptr [rsp+20H]
       vmovsd   qword ptr [rsp+18H], xmm0
       movsxd   r8, eax
       lea      r8, bword ptr [rdx+8*r8+16]
       vmovss   xmm0, dword ptr [r8]
       vmovss   dword ptr [rsp+10H], xmm0
       vmovss   xmm0, dword ptr [r8+4]
       vmovss   dword ptr [rsp+14H], xmm0
       vxorps   xmm0, xmm0
       vmovss   dword ptr [rsp+08H], xmm0
       vmovss   dword ptr [rsp+0CH], xmm0
       lea      r8, bword ptr [rsp+18H]
       vmovsd   xmm0, xmmword ptr [r8]
       lea      r8, bword ptr [rsp+10H]
       vmovsd   xmm1, xmmword ptr [r8]
       vaddps   xmm0, xmm0, xmm1
       lea      r8, bword ptr [rsp+08H]
       vmovsd   xmmword ptr [r8], xmm0
       vmovsd   xmm0, qword ptr [rsp+08H]
       vmovsd   qword ptr [rsp+20H], xmm0
       inc      eax
       cmp      ecx, eax
       jg       SHORT G_M6507_IG03
Member

tannergooding commented Aug 24, 2018

@CarolEidt, in this case I think it would need to be cut from simdintrinsiclist.h, hwintrinsiclistxarch is for the HWIntrinsics, not for System.Numerics.Vector2

@saucecontrol, you will also get slightly better codegen if you don't initialize vres (which cuts out 22 bytes).

Vector2 vres;

var vleft = Sse.StaticCast<double, float>(Sse2.LoadScalarVector128((double*)&left.X));
var vright = Sse.StaticCast<double, float>(Sse2.LoadScalarVector128((double*)&right.X));
Sse2.StoreScalar((double*)&vres, Sse.StaticCast<float, double>(Sse.Add(vleft, vright)));

return vres;

which produces:

G_M6507_IG03:
       vmovsd   xmm0, qword ptr [rsp+20H]
       vmovsd   qword ptr [rsp+18H], xmm0
       movsxd   r8, eax
       lea      r8, bword ptr [rdx+8*r8+16]
       vmovss   xmm0, dword ptr [r8]
       vmovss   dword ptr [rsp+10H], xmm0
       vmovss   xmm0, dword ptr [r8+4]
       vmovss   dword ptr [rsp+14H], xmm0
       vxorps   xmm0, xmm0
       vmovss   dword ptr [rsp+08H], xmm0
       vmovss   dword ptr [rsp+0CH], xmm0
       lea      r8, bword ptr [rsp+18H]
       vmovsd   xmm0, xmmword ptr [r8]
       lea      r8, bword ptr [rsp+10H]
       vmovsd   xmm1, xmmword ptr [r8]
       vaddps   xmm0, xmm0, xmm1
       lea      r8, bword ptr [rsp+08H]
       vmovsd   xmmword ptr [r8], xmm0
       vmovsd   xmm0, qword ptr [rsp+08H]
       vmovsd   qword ptr [rsp+20H], xmm0
       inc      eax
       cmp      ecx, eax
       jg       SHORT G_M6507_IG03
@tannergooding

This comment has been minimized.

Show comment
Hide comment
@tannergooding

tannergooding Aug 24, 2018

Member

Doing (which is mostly removing the locals and accessing left/right directly, rather than .X):

public Vector2 AddEmUp_HW(Vector2[] vectors)
{
    var sum = Vector2.Zero;
    
    for (var i = 0; i < vectors.Length; i++)
    {
        sum = AddHW(sum, vectors[i]);
    }

    return sum;
}

[MethodImpl(MethodImplOptions.AggressiveInlining)]
unsafe public static Vector2 AddHW(Vector2 left, Vector2 right)
{
    Vector2 vres;

    Sse2.StoreScalar(
        (double*)&vres,
        Sse.StaticCast<float, double>(Sse.Add(
            Sse.StaticCast<double, float>(Sse2.LoadScalarVector128((double*)&left)),
            Sse.StaticCast<double, float>(Sse2.LoadScalarVector128((double*)&right)))
        )
    );

    return vres;
}

Looks to give the "best" assembly listing (including the header comments, since there are 4 less temps the JIT is tracking):

; Assembly listing for method Vector2:AddEmUp_HW(ref):struct:this
; Emitting BLENDED_CODE for X64 CPU with AVX
; optimized code
; rsp based frame
; fully interruptible
; Final local variable assignments
;
;* V00 this         [V00    ] (  0,  0   )   byref  ->  zero-ref    this
;  V01 arg1         [V01,T03] (  4,  7   )     ref  ->  rdx         class-hnd
;  V02 loc0         [V02,T02] (  4, 10   )   simd8  ->  [rsp+0x20]   do-not-enreg[SF] must-init
;  V03 loc1         [V03,T01] (  5, 17   )     int  ->  rax        
;# V04 OutArgs      [V04    ] (  1,  1   )  lclBlk ( 0) [rsp+0x00]  
;  V05 tmp1         [V05    ] (  4, 16   )   simd8  ->  [rsp+0x18]   do-not-enreg[XSF] must-init addr-exposed ld-addr-op
;  V06 tmp2         [V06    ] (  2, 16   )   simd8  ->  [rsp+0x10]   do-not-enreg[XSB] addr-exposed ld-addr-op
;  V07 tmp3         [V07    ] (  3, 24   )   simd8  ->  [rsp+0x08]   do-not-enreg[XS] addr-exposed ld-addr-op
;  V08 tmp4         [V08    ] (  3, 12   )   float  ->  [rsp+0x18]   do-not-enreg[X] addr-exposed V05.X(offs=0x00) P-DEP
;  V09 tmp5         [V09    ] (  3, 12   )   float  ->  [rsp+0x1C]   do-not-enreg[X] addr-exposed V05.Y(offs=0x04) P-DEP
;  V10 tmp6         [V10    ] (  2, 16   )   float  ->  [rsp+0x10]   do-not-enreg[X] addr-exposed V06.X(offs=0x00) P-DEP
;  V11 tmp7         [V11    ] (  2, 16   )   float  ->  [rsp+0x14]   do-not-enreg[X] addr-exposed V06.Y(offs=0x04) P-DEP
;  V12 tmp8         [V12    ] (  2, 12   )   float  ->  [rsp+0x08]   do-not-enreg[X] addr-exposed V07.X(offs=0x00) P-DEP
;  V13 tmp9         [V13    ] (  2, 12   )   float  ->  [rsp+0x0C]   do-not-enreg[X] addr-exposed V07.Y(offs=0x04) P-DEP
;  V14 tmp10        [V14,T00] (  3, 24   )   byref  ->   r8        
;  V15 cse0         [V15,T04] (  3,  6   )     int  ->  rcx        
;
; Lcl frame size = 40
G_M6507_IG01:
       sub      rsp, 40
       vzeroupper 
       xor      rax, rax
       mov      qword ptr [rsp+20H], rax
       mov      qword ptr [rsp+18H], rax
G_M6507_IG02:
       vxorps   xmm0, xmm0
       vmovsd   qword ptr [rsp+20H], xmm0
       xor      eax, eax
       mov      ecx, dword ptr [rdx+8]
       test     ecx, ecx
       jle      SHORT G_M6507_IG04
G_M6507_IG03:
       vmovsd   xmm0, qword ptr [rsp+20H]
       vmovsd   qword ptr [rsp+10H], xmm0
       movsxd   r8, eax
       lea      r8, bword ptr [rdx+8*r8+16]
       vmovss   xmm0, dword ptr [r8]
       vmovss   dword ptr [rsp+08H], xmm0
       vmovss   xmm0, dword ptr [r8+4]
       vmovss   dword ptr [rsp+0CH], xmm0
       vxorps   xmm0, xmm0
       vmovss   dword ptr [rsp+18H], xmm0
       vmovss   dword ptr [rsp+1CH], xmm0
       lea      r8, bword ptr [rsp+10H]
       vmovsd   xmm0, xmmword ptr [r8]
       lea      r8, bword ptr [rsp+08H]
       vmovsd   xmm1, xmmword ptr [r8]
       vaddps   xmm0, xmm0, xmm1
       lea      r8, bword ptr [rsp+18H]
       vmovsd   xmmword ptr [r8], xmm0
       vmovsd   xmm0, qword ptr [rsp+18H]
       vmovsd   qword ptr [rsp+20H], xmm0
       inc      eax
       cmp      ecx, eax
       jg       SHORT G_M6507_IG03
G_M6507_IG04:
       vmovsd   xmm0, qword ptr [rsp+20H]
       vmovd    rax, xmm0
G_M6507_IG05:
       add      rsp, 40
       ret      
; Total bytes of code 178, prolog size 19 for method Vector2:AddEmUp_HW(ref):struct:this
; ============================================================
Unwind Info:
  >> Start offset   : 0x000000 (not in unwind data)
  >>   End offset   : 0xd1ffab1e (not in unwind data)
  Version           : 1
  Flags             : 0x00
  SizeOfProlog      : 0x04
  CountOfUnwindCodes: 1
  FrameRegister     : none (0)
  FrameOffset       : N/A (no FrameRegister) (Value=0)
  UnwindCodes       :
    CodeOffset: 0x04 UnwindOp: UWOP_ALLOC_SMALL (2)     OpInfo: 4 * 8 + 8 = 40 = 0x28
Member

tannergooding commented Aug 24, 2018

Doing (which is mostly removing the locals and accessing left/right directly, rather than .X):

public Vector2 AddEmUp_HW(Vector2[] vectors)
{
    var sum = Vector2.Zero;
    
    for (var i = 0; i < vectors.Length; i++)
    {
        sum = AddHW(sum, vectors[i]);
    }

    return sum;
}

[MethodImpl(MethodImplOptions.AggressiveInlining)]
unsafe public static Vector2 AddHW(Vector2 left, Vector2 right)
{
    Vector2 vres;

    Sse2.StoreScalar(
        (double*)&vres,
        Sse.StaticCast<float, double>(Sse.Add(
            Sse.StaticCast<double, float>(Sse2.LoadScalarVector128((double*)&left)),
            Sse.StaticCast<double, float>(Sse2.LoadScalarVector128((double*)&right)))
        )
    );

    return vres;
}

Looks to give the "best" assembly listing (including the header comments, since there are 4 less temps the JIT is tracking):

; Assembly listing for method Vector2:AddEmUp_HW(ref):struct:this
; Emitting BLENDED_CODE for X64 CPU with AVX
; optimized code
; rsp based frame
; fully interruptible
; Final local variable assignments
;
;* V00 this         [V00    ] (  0,  0   )   byref  ->  zero-ref    this
;  V01 arg1         [V01,T03] (  4,  7   )     ref  ->  rdx         class-hnd
;  V02 loc0         [V02,T02] (  4, 10   )   simd8  ->  [rsp+0x20]   do-not-enreg[SF] must-init
;  V03 loc1         [V03,T01] (  5, 17   )     int  ->  rax        
;# V04 OutArgs      [V04    ] (  1,  1   )  lclBlk ( 0) [rsp+0x00]  
;  V05 tmp1         [V05    ] (  4, 16   )   simd8  ->  [rsp+0x18]   do-not-enreg[XSF] must-init addr-exposed ld-addr-op
;  V06 tmp2         [V06    ] (  2, 16   )   simd8  ->  [rsp+0x10]   do-not-enreg[XSB] addr-exposed ld-addr-op
;  V07 tmp3         [V07    ] (  3, 24   )   simd8  ->  [rsp+0x08]   do-not-enreg[XS] addr-exposed ld-addr-op
;  V08 tmp4         [V08    ] (  3, 12   )   float  ->  [rsp+0x18]   do-not-enreg[X] addr-exposed V05.X(offs=0x00) P-DEP
;  V09 tmp5         [V09    ] (  3, 12   )   float  ->  [rsp+0x1C]   do-not-enreg[X] addr-exposed V05.Y(offs=0x04) P-DEP
;  V10 tmp6         [V10    ] (  2, 16   )   float  ->  [rsp+0x10]   do-not-enreg[X] addr-exposed V06.X(offs=0x00) P-DEP
;  V11 tmp7         [V11    ] (  2, 16   )   float  ->  [rsp+0x14]   do-not-enreg[X] addr-exposed V06.Y(offs=0x04) P-DEP
;  V12 tmp8         [V12    ] (  2, 12   )   float  ->  [rsp+0x08]   do-not-enreg[X] addr-exposed V07.X(offs=0x00) P-DEP
;  V13 tmp9         [V13    ] (  2, 12   )   float  ->  [rsp+0x0C]   do-not-enreg[X] addr-exposed V07.Y(offs=0x04) P-DEP
;  V14 tmp10        [V14,T00] (  3, 24   )   byref  ->   r8        
;  V15 cse0         [V15,T04] (  3,  6   )     int  ->  rcx        
;
; Lcl frame size = 40
G_M6507_IG01:
       sub      rsp, 40
       vzeroupper 
       xor      rax, rax
       mov      qword ptr [rsp+20H], rax
       mov      qword ptr [rsp+18H], rax
G_M6507_IG02:
       vxorps   xmm0, xmm0
       vmovsd   qword ptr [rsp+20H], xmm0
       xor      eax, eax
       mov      ecx, dword ptr [rdx+8]
       test     ecx, ecx
       jle      SHORT G_M6507_IG04
G_M6507_IG03:
       vmovsd   xmm0, qword ptr [rsp+20H]
       vmovsd   qword ptr [rsp+10H], xmm0
       movsxd   r8, eax
       lea      r8, bword ptr [rdx+8*r8+16]
       vmovss   xmm0, dword ptr [r8]
       vmovss   dword ptr [rsp+08H], xmm0
       vmovss   xmm0, dword ptr [r8+4]
       vmovss   dword ptr [rsp+0CH], xmm0
       vxorps   xmm0, xmm0
       vmovss   dword ptr [rsp+18H], xmm0
       vmovss   dword ptr [rsp+1CH], xmm0
       lea      r8, bword ptr [rsp+10H]
       vmovsd   xmm0, xmmword ptr [r8]
       lea      r8, bword ptr [rsp+08H]
       vmovsd   xmm1, xmmword ptr [r8]
       vaddps   xmm0, xmm0, xmm1
       lea      r8, bword ptr [rsp+18H]
       vmovsd   xmmword ptr [r8], xmm0
       vmovsd   xmm0, qword ptr [rsp+18H]
       vmovsd   qword ptr [rsp+20H], xmm0
       inc      eax
       cmp      ecx, eax
       jg       SHORT G_M6507_IG03
G_M6507_IG04:
       vmovsd   xmm0, qword ptr [rsp+20H]
       vmovd    rax, xmm0
G_M6507_IG05:
       add      rsp, 40
       ret      
; Total bytes of code 178, prolog size 19 for method Vector2:AddEmUp_HW(ref):struct:this
; ============================================================
Unwind Info:
  >> Start offset   : 0x000000 (not in unwind data)
  >>   End offset   : 0xd1ffab1e (not in unwind data)
  Version           : 1
  Flags             : 0x00
  SizeOfProlog      : 0x04
  CountOfUnwindCodes: 1
  FrameRegister     : none (0)
  FrameOffset       : N/A (no FrameRegister) (Value=0)
  UnwindCodes       :
    CodeOffset: 0x04 UnwindOp: UWOP_ALLOC_SMALL (2)     OpInfo: 4 * 8 + 8 = 40 = 0x28
@CarolEidt

This comment has been minimized.

Show comment
Hide comment
@CarolEidt

CarolEidt Aug 24, 2018

Member

in this case I think it would need to be cut from simdintrinsiclist.h

Yes, of course - thanks!

Member

CarolEidt commented Aug 24, 2018

in this case I think it would need to be cut from simdintrinsiclist.h

Yes, of course - thanks!

@saucecontrol

This comment has been minimized.

Show comment
Hide comment
@saucecontrol

saucecontrol Aug 24, 2018

Member

@tannergooding thanks for the tips. I probably should have started out with code closer to what I actually expect to use. Ideally, I'd like to structure it something like this:

internal static class Vector2Extensions
{
    [MethodImpl(MethodImplOptions.AggressiveInlining)]
    internal static unsafe Vector128<float> AsVector128(this Vector2 v2) =>
        Sse.StaticCast<double, float>(Sse2.LoadScalarVector128((double*)&v2));

}

[MethodImpl(MethodImplOptions.AggressiveInlining)]
private static unsafe Vector2 FromVector128(Vector128<float> v128)
{
    Vector2 v2;
    Sse2.StoreScalar((double*)&v2, Sse.StaticCast<float, double>(v128));
    return v2;
}

[MethodImpl(MethodImplOptions.AggressiveInlining)]
public static unsafe Vector2 operator +(Vector2 left, Vector2 right)
{
#if HAS_INTRINSICS
    if (Sse2.IsSupported)
        return FromVector128(Sse.Add(left.AsVector128(), right.AsVector128()));
#endif

    return new Vector2(left.X + right.X, left.Y + right.Y);
}

That gives me this, which sort of splits the difference but allocates more xmm regs.

G_M32828_IG03:
       vmovss   xmm0, dword ptr [rsp+38H]
       vmovss   xmm1, dword ptr [rsp+3CH]
       movsxd   r8, edx
       lea      r8, bword ptr [rax+8*r8+16]
       vmovss   xmm2, dword ptr [r8]
       vmovss   xmm3, dword ptr [r8+4]
       vmovss   dword ptr [rsp+30H], xmm0
       vmovss   dword ptr [rsp+34H], xmm1
       lea      r8, bword ptr [rsp+30H]
       vmovsd   xmm0, xmmword ptr [r8]
       vmovss   dword ptr [rsp+28H], xmm2
       vmovss   dword ptr [rsp+2CH], xmm3
       lea      r8, bword ptr [rsp+28H]
       vmovsd   xmm1, xmmword ptr [r8]
       vaddps   xmm0, xmm0, xmm1
       vxorps   xmm1, xmm1
       vmovss   dword ptr [rsp+20H], xmm1
       vmovss   dword ptr [rsp+24H], xmm1
       lea      r8, bword ptr [rsp+20H]
       vmovsd   xmmword ptr [r8], xmm0
       mov      r8, qword ptr [rsp+20H]
       mov      qword ptr [rsp+38H], r8
       inc      edx
       cmp      ecx, edx
       jg       G_M32828_IG03

@CarolEidt I restored the Vector2 recognition in the JIT and just disabled the op_Addition mapping, but that didn't improve anything. In fact, it made things slightly worse. Looks like it's even more confused about whether it's a SIMD type or a normal struct.

image

Ideally, I'd like to approach this one class at a time, especially since the method Intrinsic mappings affect Vector<T> in addition to Vector2/3/4. I figured Vector4 would be the cleanest and Vector3 would be the ugliest, so I decided to split the difference and start with Vector2. I do see there's an extra challenge there in that SIMD8 fits in a GP reg on x64, so maybe that's not the best place to start after all.

I'd like to make sure all the codegen issues that come up here are being tracked, but I'm not familiar enough with compiler terminology to find the existing issues or differentiate them sometimes. Obviously, I can see what's wrong with the code that comes out of the JIT, but I'd like to be able to create useful/accurate tracking issues for those things. I'll spend some time today looking through the open issues to see what I can match up.

Member

saucecontrol commented Aug 24, 2018

@tannergooding thanks for the tips. I probably should have started out with code closer to what I actually expect to use. Ideally, I'd like to structure it something like this:

internal static class Vector2Extensions
{
    [MethodImpl(MethodImplOptions.AggressiveInlining)]
    internal static unsafe Vector128<float> AsVector128(this Vector2 v2) =>
        Sse.StaticCast<double, float>(Sse2.LoadScalarVector128((double*)&v2));

}

[MethodImpl(MethodImplOptions.AggressiveInlining)]
private static unsafe Vector2 FromVector128(Vector128<float> v128)
{
    Vector2 v2;
    Sse2.StoreScalar((double*)&v2, Sse.StaticCast<float, double>(v128));
    return v2;
}

[MethodImpl(MethodImplOptions.AggressiveInlining)]
public static unsafe Vector2 operator +(Vector2 left, Vector2 right)
{
#if HAS_INTRINSICS
    if (Sse2.IsSupported)
        return FromVector128(Sse.Add(left.AsVector128(), right.AsVector128()));
#endif

    return new Vector2(left.X + right.X, left.Y + right.Y);
}

That gives me this, which sort of splits the difference but allocates more xmm regs.

G_M32828_IG03:
       vmovss   xmm0, dword ptr [rsp+38H]
       vmovss   xmm1, dword ptr [rsp+3CH]
       movsxd   r8, edx
       lea      r8, bword ptr [rax+8*r8+16]
       vmovss   xmm2, dword ptr [r8]
       vmovss   xmm3, dword ptr [r8+4]
       vmovss   dword ptr [rsp+30H], xmm0
       vmovss   dword ptr [rsp+34H], xmm1
       lea      r8, bword ptr [rsp+30H]
       vmovsd   xmm0, xmmword ptr [r8]
       vmovss   dword ptr [rsp+28H], xmm2
       vmovss   dword ptr [rsp+2CH], xmm3
       lea      r8, bword ptr [rsp+28H]
       vmovsd   xmm1, xmmword ptr [r8]
       vaddps   xmm0, xmm0, xmm1
       vxorps   xmm1, xmm1
       vmovss   dword ptr [rsp+20H], xmm1
       vmovss   dword ptr [rsp+24H], xmm1
       lea      r8, bword ptr [rsp+20H]
       vmovsd   xmmword ptr [r8], xmm0
       mov      r8, qword ptr [rsp+20H]
       mov      qword ptr [rsp+38H], r8
       inc      edx
       cmp      ecx, edx
       jg       G_M32828_IG03

@CarolEidt I restored the Vector2 recognition in the JIT and just disabled the op_Addition mapping, but that didn't improve anything. In fact, it made things slightly worse. Looks like it's even more confused about whether it's a SIMD type or a normal struct.

image

Ideally, I'd like to approach this one class at a time, especially since the method Intrinsic mappings affect Vector<T> in addition to Vector2/3/4. I figured Vector4 would be the cleanest and Vector3 would be the ugliest, so I decided to split the difference and start with Vector2. I do see there's an extra challenge there in that SIMD8 fits in a GP reg on x64, so maybe that's not the best place to start after all.

I'd like to make sure all the codegen issues that come up here are being tracked, but I'm not familiar enough with compiler terminology to find the existing issues or differentiate them sometimes. Obviously, I can see what's wrong with the code that comes out of the JIT, but I'd like to be able to create useful/accurate tracking issues for those things. I'll spend some time today looking through the open issues to see what I can match up.

@CarolEidt

This comment has been minimized.

Show comment
Hide comment
@CarolEidt

CarolEidt Aug 24, 2018

Member

In fact, it made things slightly worse. Looks like it's even more confused about whether it's a SIMD type or a normal struct.

I'm not entirely sure, but this may be due to some pessimization of the GPR-passed types in the importer, before the method is inlined. However, I thought that all happened in fgMorphArgs (i.e. after the inliner).

Member

CarolEidt commented Aug 24, 2018

In fact, it made things slightly worse. Looks like it's even more confused about whether it's a SIMD type or a normal struct.

I'm not entirely sure, but this may be due to some pessimization of the GPR-passed types in the importer, before the method is inlined. However, I thought that all happened in fgMorphArgs (i.e. after the inliner).

@fiigii

This comment has been minimized.

Show comment
Hide comment
@fiigii

fiigii Aug 27, 2018

Contributor

I still would like to say that the current approach is very inefficient (#31779 (comment)).

Each S.N.Vector intrinsic in #31425 (comment) loads data src, does some simple SIMD computation, and stores the result back memory. That would be fine for micro-benchmarks and Matrix4x4 (that was not vectorized), but I believe that would make obvious regression on scenarios that heavily use Vector3/vector4/Vector<T>.

For example, the code below is generated from the PacketTracer benchmark that use hardware intrinsic directly. You can imagine that memory access would be inserted around (almost) each SIMD instruction if we "rewrite" it in Vector<T> (the rewriting is impossible, just as an example). Furthermore, currently RyuJIT has no memory dependency analysis (like memory SSA), it is impossible to eliminate these loads/stores by JIT.

vaddps ymm2, ymm2, ymmword ptr [r8+0x8]
vmulps ymm2, ymm2, ymm0
vaddps ymm2, ymm2, ymmword ptr [rax]
vmulps ymm2, ymm2, ymm3
vaddps ymm0, ymm2, ymm0
vaddps ymm0, ymm0, ymmword ptr [rcx]
vpaddd ymm1, ymm1, ymmword ptr [rdx]
vpslld ymm1, ymm1, 0x17
vmulps ymm0, ymm0, ymm1
vmulps ymm1, ymm0, ymmword ptr   [rsp+0x600]
vmulps ymm2, ymm0, ymmword ptr   [rsp+0x5e0]
vmulps ymm0, ymm0, ymmword ptr   [rsp+0x5c0]
vmovupd ymmword ptr [rsp+0x260], ymm1 ;;; write back the results after finish as many as possible SIMD computations
vmovupd ymmword ptr [rsp+0x240], ymm2
vmovupd ymmword ptr [rsp+0x220], ymm0
Contributor

fiigii commented Aug 27, 2018

I still would like to say that the current approach is very inefficient (#31779 (comment)).

Each S.N.Vector intrinsic in #31425 (comment) loads data src, does some simple SIMD computation, and stores the result back memory. That would be fine for micro-benchmarks and Matrix4x4 (that was not vectorized), but I believe that would make obvious regression on scenarios that heavily use Vector3/vector4/Vector<T>.

For example, the code below is generated from the PacketTracer benchmark that use hardware intrinsic directly. You can imagine that memory access would be inserted around (almost) each SIMD instruction if we "rewrite" it in Vector<T> (the rewriting is impossible, just as an example). Furthermore, currently RyuJIT has no memory dependency analysis (like memory SSA), it is impossible to eliminate these loads/stores by JIT.

vaddps ymm2, ymm2, ymmword ptr [r8+0x8]
vmulps ymm2, ymm2, ymm0
vaddps ymm2, ymm2, ymmword ptr [rax]
vmulps ymm2, ymm2, ymm3
vaddps ymm0, ymm2, ymm0
vaddps ymm0, ymm0, ymmword ptr [rcx]
vpaddd ymm1, ymm1, ymmword ptr [rdx]
vpslld ymm1, ymm1, 0x17
vmulps ymm0, ymm0, ymm1
vmulps ymm1, ymm0, ymmword ptr   [rsp+0x600]
vmulps ymm2, ymm0, ymmword ptr   [rsp+0x5e0]
vmulps ymm0, ymm0, ymmword ptr   [rsp+0x5c0]
vmovupd ymmword ptr [rsp+0x260], ymm1 ;;; write back the results after finish as many as possible SIMD computations
vmovupd ymmword ptr [rsp+0x240], ymm2
vmovupd ymmword ptr [rsp+0x220], ymm0
@tannergooding

This comment has been minimized.

Show comment
Hide comment
@tannergooding

tannergooding Aug 27, 2018

Member

@fiigii, these are likely things that the JIT will have to fix. Without it, only code in System.Private.CoreLib will be able to remain efficient and any user defined structs will be hit with these inefficiencies. CC. @CarolEidt

Member

tannergooding commented Aug 27, 2018

@fiigii, these are likely things that the JIT will have to fix. Without it, only code in System.Private.CoreLib will be able to remain efficient and any user defined structs will be hit with these inefficiencies. CC. @CarolEidt

@fiigii

This comment has been minimized.

Show comment
Hide comment
@fiigii

fiigii Aug 27, 2018

Contributor

these are likely things that the JIT will have to fix.

AFAIK, that would be really difficult.
Why not just add an "internal" intrinsic that converts Vector128<T>/Vector256<T> between S.N.Vectors (like StaticCast<T, U>)?

Contributor

fiigii commented Aug 27, 2018

these are likely things that the JIT will have to fix.

AFAIK, that would be really difficult.
Why not just add an "internal" intrinsic that converts Vector128<T>/Vector256<T> between S.N.Vectors (like StaticCast<T, U>)?

@tannergooding

This comment has been minimized.

Show comment
Hide comment
@tannergooding

tannergooding Aug 27, 2018

Member

Because not everyone uses S.N.Vectors, some people use their own Vector types (like Unity or Math.NET, etc)

Member

tannergooding commented Aug 27, 2018

Because not everyone uses S.N.Vectors, some people use their own Vector types (like Unity or Math.NET, etc)

@benaadams

This comment has been minimized.

Show comment
Hide comment
@benaadams

benaadams Aug 27, 2018

Collaborator

any user defined structs will be hit with these inefficiencies.

There's an issue for it dotnet/coreclr#18542

Collaborator

benaadams commented Aug 27, 2018

any user defined structs will be hit with these inefficiencies.

There's an issue for it dotnet/coreclr#18542

@CarolEidt

This comment has been minimized.

Show comment
Hide comment
@CarolEidt

CarolEidt Aug 28, 2018

Member

AFAIK, that would be really difficult.
Why not just add an "internal" intrinsic that converts Vector128/Vector256 between S.N.Vectors (like StaticCast<T, U>)?

First, without a full analysis of where the inefficiency is coming from in this case, it is hard to say how difficult this would be.

On the other hand, the messiness of a conversion intrinsic would not only be a non-trivial amount of work, but would be an unfortunate way to gloss over a fundamental inefficiency.

I think we need to analyze what's the real source of these inefficiencies (there are a number of candidates) and assess the cost of fixing them.

Member

CarolEidt commented Aug 28, 2018

AFAIK, that would be really difficult.
Why not just add an "internal" intrinsic that converts Vector128/Vector256 between S.N.Vectors (like StaticCast<T, U>)?

First, without a full analysis of where the inefficiency is coming from in this case, it is hard to say how difficult this would be.

On the other hand, the messiness of a conversion intrinsic would not only be a non-trivial amount of work, but would be an unfortunate way to gloss over a fundamental inefficiency.

I think we need to analyze what's the real source of these inefficiencies (there are a number of candidates) and assess the cost of fixing them.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment