Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Recognize the `Vector128/256.AsVector*` methods as intrinsic #1280

Open
wants to merge 4 commits into
base: master
from

Conversation

@tannergooding
Copy link
Member

tannergooding commented Jan 3, 2020

This updates the Vector128.AsVector* and Vector256.AsVector* methods to be intrinsic so they can be low/zero cost where applicable.


case NI_Vector128_AsVector2:
case NI_Vector128_AsVector3:
case NI_Vector128_AsVector4:

This comment has been minimized.

Copy link
@tannergooding

tannergooding Jan 3, 2020

Author Member

AsVector4 is the simplest, we have a TYP_SIMD16 and are going to a TYP_SIMD16

AsVector2 and AsVector3 should also be simple. We have a TYP_SIMD16 and we want a TYP_SIMD8 or TYP_SIMD12. If they are already in register, there is no change needed. If they are in memory, we have 16-bytes and are correctly aligned, so it should likewise be safe to just access.

This comment has been minimized.

Copy link
@CarolEidt

CarolEidt Jan 7, 2020

Member

If they are already in register, there is no change needed.

Right, but we should ensure that they get the right class layout.

This comment has been minimized.

Copy link
@mikedn

mikedn Jan 7, 2020

Collaborator

If they are already in register, there is no change needed

I'm not so sure. Some SIMD8/12 operations intentionally zero out the upper unused elements (SIMDIntrinsicDiv and maybe others). Either that zeroing is unnecessary or AsVector2/3 should also zero out.

This comment has been minimized.

Copy link
@tannergooding

tannergooding Jan 7, 2020

Author Member

I believe that's just to avoid the penalty around x / 0 though rather than a functional requirement.

We may want to have both options a "simple, but correct" API. That is, one which ensures that the upper bits are correctly zeroed
and a "fast, but unsafe" API. That is, one which leaves clearing/setting the upper bits to the user; that way the conversion is as low as possible (generally free).

This comment has been minimized.

Copy link
@tannergooding

tannergooding Jan 16, 2020

Author Member

@CarolEidt, do you have an opinion here?

Should I, when converting Vector128<float> to Vector2/3, explicitly zero the untracked z/w fields or should that responsibility be left up to the consumer if they feel it is important?

As best as I can tell, it doesn't matter from a functional perspective and at worst could cause a minor perf hit if those fields end up containing a subnormal value or NaN.

This comment has been minimized.

Copy link
@tannergooding

tannergooding Jan 16, 2020

Author Member

No, dpps allows you to specify which elements are consumed and which are ignored as part of the computation.

This comment has been minimized.

Copy link
@tannergooding

tannergooding Jan 16, 2020

Author Member

Looks like we might need to update the codegen for Vector2.Dot however. It currently generates the same as Vector4 using 0xF1 as the control word and it should be 0x3F or 0x31. Vector3.Dot already correctly uses 0x71

This comment has been minimized.

Copy link
@saucecontrol

saucecontrol Jan 16, 2020

Member

Yeah, just saw that. The SSE2 fallback for Vector2 is also using the Vector4 code from what I can tell

This comment has been minimized.

Copy link
@tannergooding

tannergooding Jan 16, 2020

Author Member

Sounds like some room for improvement there then. I'll take a look at a few other cases and will log bugs to start fixing them.

This comment has been minimized.

Copy link
@tannergooding

tannergooding Jan 17, 2020

Author Member

I reverted the change treating AsVector2/AsVector3 as intrinsic for right now since it requires some fixes to the SIMD code and since it looks like it requires a few fixes elsewhere in the JIT as well.

break;
}

case NI_Vector128_AsVector128:

This comment has been minimized.

Copy link
@tannergooding

tannergooding Jan 3, 2020

Author Member

AsVector128 is a bit more complex.

For Vector4 to Vector128 it should be simple as it is TYP_SIMD16 to TYP_SIMD16

The other two (which are currently unhandled) I'd like some input on how to handle.
If they are in register, it can be a nop. Likewise, if they were spilled to the stack, they can be a nop.
However, if it is a field or an array element, we have to read 8 or 12-bytes exactly (I probably need to propose Unsafe versions so we don't have to zero-init the upper element as well).

Should we carry the NI_Vector128_AsVector128 through to codegen and decide what to do there (nop vs actual read, depending on location) or is there something clever we can do earlier in the compilation?

This comment has been minimized.

Copy link
@tannergooding

tannergooding Jan 3, 2020

Author Member

CC. @CarolEidt for input here

This comment has been minimized.

Copy link
@CarolEidt

CarolEidt Jan 7, 2020

Member

It would be nice to be able to handle this in a "generic" way - that is, using a CAST from one type to another that does the appropriate widening. However, I don't think the current handling of casts is "generic" in a good way. So, I'd probably vote for carrying the conversion through to Lowering and codegen.

This comment has been minimized.

Copy link
@mikedn

mikedn Jan 7, 2020

Collaborator

that is, using a CAST from one type to another that does the appropriate widening.

Such a cast shouldn't be needed, much like we don't need a cast from short to int after a short indir. A SIMD12 indir really produces a SIMD16 value.

This comment has been minimized.

Copy link
@tannergooding

tannergooding Jan 16, 2020

Author Member

I'm going to leave this to a future PR for now. Vector2/3 to Vector128<float> is going to be left as a managed implementation which explicitly zeroes the unused upper elements.

In particular I want to get some more testing and samples done for this and don't want to block the other changes.

break;
}

case NI_Vector256_AsVector:

This comment has been minimized.

Copy link
@tannergooding

tannergooding Jan 3, 2020

Author Member

AsVector is also fairly simple.

If we don't support AVX, we just go to the software fallback
If we do support AVX, but not AVX2 (meaning Vector<T> is 16-bytes) then we just need to GetLower (which is generally a nop, as we are either in register or a memory location that is correctly aligned and at least 16-bytes)

}

case NI_Vector256_AsVector:
case NI_Vector256_AsVector256:

This comment has been minimized.

Copy link
@tannergooding

tannergooding Jan 3, 2020

Author Member

AsVector256 (which is Vector<T> to Vector256<T>) is similar.

If we don't support AVX, we just go to the software fallback
If we support AVX2 (meaning Vector<T> is 32-bytes) then it is a nop
If we only support AVX we just set the upper 128-bits to zero via ToVector256 (likewise, we probably want an Unsafe API so users can get non-deterministic bits, as we've done for other apis).

This comment has been minimized.

Copy link
@mikedn

mikedn Jan 7, 2020

Collaborator

users can get non-deterministic bits

What does that mean? Who needs non-deterministic bits? This isn't a random number generator.

This comment has been minimized.

Copy link
@tannergooding

tannergooding Jan 7, 2020

Author Member

The intent is to provide an API which provides fast conversion and requires the user to manage the upper bits themselves.

Similarly to how we have both ToVector256 and ToVector256Unsafe. The former explicitly zeros the upper 128-bits and the latter is an effective nop that just allows access to the upper 128-bits of the register.

@tannergooding tannergooding force-pushed the tannergooding:as-vector branch 2 times, most recently from e0cda97 to 4dc66de Jan 6, 2020
@tannergooding

This comment has been minimized.

Copy link
Member Author

tannergooding commented Jan 7, 2020

@tannergooding tannergooding force-pushed the tannergooding:as-vector branch from f4455fb to 861a68f Jan 10, 2020
@tannergooding

This comment has been minimized.

Copy link
Member Author

tannergooding commented Jan 10, 2020

Just rebased onto current head. Working on resolving feedback given so far.

@tannergooding

This comment has been minimized.

Copy link
Member Author

tannergooding commented Jan 17, 2020

The current changes results in the following diff (slimmed down to just the modified tests, there were no changes to other tests or framework/runtime assemblies):

Found 8 files with textual diffs.

Summary of Code Size diffs:
(Lower is better)

Total bytes of diff: -2456 (-0.14% of base)
    diff is an improvement.

Top file improvements (bytes):
        -672 : Vector128_1_r.dasm (-0.14% of base)
        -660 : Vector256_1_r.dasm (-0.12% of base)
        -624 : Vector128_1_ro.dasm (-0.19% of base)
        -500 : Vector256_1_ro.dasm (-0.13% of base)

4 total files with Code Size differences (4 improved, 0 regressed), 0 unchanged.

Top method regressions (bytes):
          16 (10.53% of base) : Vector128_1_ro.dasm - VectorAs__AsVector4Single:RunBasicScenario():this

Top method improvements (bytes):
         -66 (-17.19% of base) : Vector128_1_r.dasm - VectorAs__AsVector4Single:RunBasicScenario():this
         -66 (-16.46% of base) : Vector256_1_r.dasm - VectorAs__AsVectorByte:RunBasicScenario():this
         -66 (-16.18% of base) : Vector256_1_r.dasm - VectorAs__AsVectorDouble:RunBasicScenario():this
         -66 (-16.46% of base) : Vector256_1_r.dasm - VectorAs__AsVectorInt16:RunBasicScenario():this
         -66 (-16.46% of base) : Vector256_1_r.dasm - VectorAs__AsVectorInt32:RunBasicScenario():this
         -66 (-16.38% of base) : Vector256_1_r.dasm - VectorAs__AsVectorInt64:RunBasicScenario():this
         -66 (-16.46% of base) : Vector256_1_r.dasm - VectorAs__AsVectorSByte:RunBasicScenario():this
         -66 (-16.18% of base) : Vector256_1_r.dasm - VectorAs__AsVectorSingle:RunBasicScenario():this
         -66 (-16.46% of base) : Vector256_1_r.dasm - VectorAs__AsVectorUInt16:RunBasicScenario():this
         -66 (-16.46% of base) : Vector256_1_r.dasm - VectorAs__AsVectorUInt32:RunBasicScenario():this
         -66 (-16.38% of base) : Vector256_1_r.dasm - VectorAs__AsVectorUInt64:RunBasicScenario():this
         -64 (-24.62% of base) : Vector128_1_ro.dasm - VectorAs__AsVectorByte:RunBasicScenario():this
         -64 (-25.40% of base) : Vector128_1_ro.dasm - VectorAs__AsVectorDouble:RunBasicScenario():this
         -64 (-24.52% of base) : Vector128_1_ro.dasm - VectorAs__AsVectorInt16:RunBasicScenario():this
         -64 (-24.90% of base) : Vector128_1_ro.dasm - VectorAs__AsVectorInt32:RunBasicScenario():this
         -64 (-24.81% of base) : Vector128_1_ro.dasm - VectorAs__AsVectorInt64:RunBasicScenario():this
         -64 (-24.52% of base) : Vector128_1_ro.dasm - VectorAs__AsVectorSByte:RunBasicScenario():this
         -64 (-25.30% of base) : Vector128_1_ro.dasm - VectorAs__AsVectorSingle:RunBasicScenario():this
         -64 (-24.62% of base) : Vector128_1_ro.dasm - VectorAs__AsVectorUInt16:RunBasicScenario():this
         -64 (-24.90% of base) : Vector128_1_ro.dasm - VectorAs__AsVectorUInt32:RunBasicScenario():this

Top method regressions (percentages):
          16 (10.53% of base) : Vector128_1_ro.dasm - VectorAs__AsVector4Single:RunBasicScenario():this

Top method improvements (percentages):
         -64 (-25.40% of base) : Vector128_1_ro.dasm - VectorAs__AsVectorDouble:RunBasicScenario():this
         -64 (-25.30% of base) : Vector128_1_ro.dasm - VectorAs__AsVectorSingle:RunBasicScenario():this
         -64 (-24.90% of base) : Vector128_1_ro.dasm - VectorAs__AsVectorInt32:RunBasicScenario():this
         -64 (-24.90% of base) : Vector128_1_ro.dasm - VectorAs__AsVectorUInt32:RunBasicScenario():this
         -64 (-24.81% of base) : Vector128_1_ro.dasm - VectorAs__AsVectorInt64:RunBasicScenario():this
         -64 (-24.81% of base) : Vector128_1_ro.dasm - VectorAs__AsVectorUInt64:RunBasicScenario():this
         -64 (-24.62% of base) : Vector128_1_ro.dasm - VectorAs__AsVectorByte:RunBasicScenario():this
         -64 (-24.62% of base) : Vector128_1_ro.dasm - VectorAs__AsVectorUInt16:RunBasicScenario():this
         -64 (-24.52% of base) : Vector128_1_ro.dasm - VectorAs__AsVectorInt16:RunBasicScenario():this
         -64 (-24.52% of base) : Vector128_1_ro.dasm - VectorAs__AsVectorSByte:RunBasicScenario():this
         -66 (-17.19% of base) : Vector128_1_r.dasm - VectorAs__AsVector4Single:RunBasicScenario():this
         -66 (-16.46% of base) : Vector256_1_r.dasm - VectorAs__AsVectorByte:RunBasicScenario():this
         -66 (-16.46% of base) : Vector256_1_r.dasm - VectorAs__AsVectorInt16:RunBasicScenario():this
         -66 (-16.46% of base) : Vector256_1_r.dasm - VectorAs__AsVectorInt32:RunBasicScenario():this
         -66 (-16.46% of base) : Vector256_1_r.dasm - VectorAs__AsVectorSByte:RunBasicScenario():this
         -66 (-16.46% of base) : Vector256_1_r.dasm - VectorAs__AsVectorUInt16:RunBasicScenario():this
         -66 (-16.46% of base) : Vector256_1_r.dasm - VectorAs__AsVectorUInt32:RunBasicScenario():this
         -66 (-16.38% of base) : Vector256_1_r.dasm - VectorAs__AsVectorInt64:RunBasicScenario():this
         -66 (-16.38% of base) : Vector256_1_r.dasm - VectorAs__AsVectorUInt64:RunBasicScenario():this
         -66 (-16.18% of base) : Vector256_1_r.dasm - VectorAs__AsVectorDouble:RunBasicScenario():this

42 total methods with Code Size differences (41 improved, 1 regressed), 3634 unchanged.

The diffs are essentially all cases like the following:

-       vmovaps  qword ptr [rsp+A0H], xmm6
-       xor      rax, rax
-       mov      qword ptr [rsp+80H], rax
-       mov      qword ptr [rsp+88H], rax
-       mov      qword ptr [rsp+90H], rax
-       mov      qword ptr [rsp+98H], rax
+       vmovaps  qword ptr [rsp+70H], xmm6
+       vmovaps  qword ptr [rsp+60H], xmm7
+       vextractf128 ymm7, ymm6, 1
        call     VectorAs__AsVectorUInt64:ValidateResult(Vector`1,Vector128`1,String):this
-       vmovupd  ymm0, ymmword ptr[rsp+80H]
-       vmovupd  ymmword ptr[rsp+60H], ymm0
-       vmovupd  xmm6, xmmword ptr [rsp+60H]
-       vmovapd  xmmword ptr [rsp+50H], xmm6
-       vmovupd  ymm0, ymmword ptr[rsp+80H]
-       vmovupd  ymmword ptr[rsp+20H], ymm0
+       vinsertf128 ymm6, ymm6, ymm7, 1
+       vmovaps  ymm0, ymm6
+       vmovapd  xmmword ptr [rsp+20H], xmm0
+       vmovupd  ymmword ptr[rsp+30H], ymm6
@tannergooding tannergooding force-pushed the tannergooding:as-vector branch from 7cf5f92 to ff37586 Jan 22, 2020
@tannergooding tannergooding marked this pull request as ready for review Jan 22, 2020
@tannergooding tannergooding changed the title [WIP] Recognize the `Vector128/256.AsVector*` methods as intrinsic Recognize the `Vector128/256.AsVector*` methods as intrinsic Jan 22, 2020
@tannergooding

This comment has been minimized.

Copy link
Member Author

tannergooding commented Jan 22, 2020

This should be ready for review now.

I did not treat the currently problematic cases (Vector2/3 to/from Vector128) as intrinsic for now. The other cases (Vector to/from Vector128/256 and Vector4 to/from Vector128) are handled as intrinsics.

Copy link
Member

CarolEidt left a comment

I think it's best to avoid the unreached asserts here, but I see that they're all over the place, even in places where there might be a reasonable default action. I'll submit a separate issue to review this.

{
// TYP_SIMD8 and TYP_SIMD12 currently only expose "safe" versions
// which zero the upper elements and so are implemented in managed.
unreached();

This comment has been minimized.

Copy link
@CarolEidt

CarolEidt Jan 23, 2020

Member

This will expand to a noway_assert I believe. I think it's best to avoid these where there's a valid default action to take, especially in a non-optimizing phase. It will cause the compile to be re-tried with MinOpts. I think the better option for these cases is to have an assert such as assert(!"Unexpected intrinsic in impBaseIntrinsic"); followed by a break, which I believe will cause it to return nullptr as for an unsupported or unrecognized intrinsic.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
5 participants
You can’t perform that action at this time.