Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Recognize the Vector128/256.AsVector* methods as intrinsic #1280

Merged
merged 4 commits into from Jan 24, 2020

Conversation

tannergooding
Copy link
Member

This updates the Vector128.AsVector* and Vector256.AsVector* methods to be intrinsic so they can be low/zero cost where applicable.


case NI_Vector128_AsVector2:
case NI_Vector128_AsVector3:
case NI_Vector128_AsVector4:
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AsVector4 is the simplest, we have a TYP_SIMD16 and are going to a TYP_SIMD16

AsVector2 and AsVector3 should also be simple. We have a TYP_SIMD16 and we want a TYP_SIMD8 or TYP_SIMD12. If they are already in register, there is no change needed. If they are in memory, we have 16-bytes and are correctly aligned, so it should likewise be safe to just access.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If they are already in register, there is no change needed.

Right, but we should ensure that they get the right class layout.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If they are already in register, there is no change needed

I'm not so sure. Some SIMD8/12 operations intentionally zero out the upper unused elements (SIMDIntrinsicDiv and maybe others). Either that zeroing is unnecessary or AsVector2/3 should also zero out.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe that's just to avoid the penalty around x / 0 though rather than a functional requirement.

We may want to have both options a "simple, but correct" API. That is, one which ensures that the upper bits are correctly zeroed
and a "fast, but unsafe" API. That is, one which leaves clearing/setting the upper bits to the user; that way the conversion is as low as possible (generally free).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@CarolEidt, do you have an opinion here?

Should I, when converting Vector128<float> to Vector2/3, explicitly zero the untracked z/w fields or should that responsibility be left up to the consumer if they feel it is important?

As best as I can tell, it doesn't matter from a functional perspective and at worst could cause a minor perf hit if those fields end up containing a subnormal value or NaN.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, dpps allows you to specify which elements are consumed and which are ignored as part of the computation.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like we might need to update the codegen for Vector2.Dot however. It currently generates the same as Vector4 using 0xF1 as the control word and it should be 0x3F or 0x31. Vector3.Dot already correctly uses 0x71

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, just saw that. The SSE2 fallback for Vector2 is also using the Vector4 code from what I can tell

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds like some room for improvement there then. I'll take a look at a few other cases and will log bugs to start fixing them.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reverted the change treating AsVector2/AsVector3 as intrinsic for right now since it requires some fixes to the SIMD code and since it looks like it requires a few fixes elsewhere in the JIT as well.

break;
}

case NI_Vector128_AsVector128:
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AsVector128 is a bit more complex.

For Vector4 to Vector128 it should be simple as it is TYP_SIMD16 to TYP_SIMD16

The other two (which are currently unhandled) I'd like some input on how to handle.
If they are in register, it can be a nop. Likewise, if they were spilled to the stack, they can be a nop.
However, if it is a field or an array element, we have to read 8 or 12-bytes exactly (I probably need to propose Unsafe versions so we don't have to zero-init the upper element as well).

Should we carry the NI_Vector128_AsVector128 through to codegen and decide what to do there (nop vs actual read, depending on location) or is there something clever we can do earlier in the compilation?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CC. @CarolEidt for input here

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice to be able to handle this in a "generic" way - that is, using a CAST from one type to another that does the appropriate widening. However, I don't think the current handling of casts is "generic" in a good way. So, I'd probably vote for carrying the conversion through to Lowering and codegen.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that is, using a CAST from one type to another that does the appropriate widening.

Such a cast shouldn't be needed, much like we don't need a cast from short to int after a short indir. A SIMD12 indir really produces a SIMD16 value.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm going to leave this to a future PR for now. Vector2/3 to Vector128<float> is going to be left as a managed implementation which explicitly zeroes the unused upper elements.

In particular I want to get some more testing and samples done for this and don't want to block the other changes.

break;
}

case NI_Vector256_AsVector:
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AsVector is also fairly simple.

If we don't support AVX, we just go to the software fallback
If we do support AVX, but not AVX2 (meaning Vector<T> is 16-bytes) then we just need to GetLower (which is generally a nop, as we are either in register or a memory location that is correctly aligned and at least 16-bytes)

}

case NI_Vector256_AsVector:
case NI_Vector256_AsVector256:
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AsVector256 (which is Vector<T> to Vector256<T>) is similar.

If we don't support AVX, we just go to the software fallback
If we support AVX2 (meaning Vector<T> is 32-bytes) then it is a nop
If we only support AVX we just set the upper 128-bits to zero via ToVector256 (likewise, we probably want an Unsafe API so users can get non-deterministic bits, as we've done for other apis).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

users can get non-deterministic bits

What does that mean? Who needs non-deterministic bits? This isn't a random number generator.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The intent is to provide an API which provides fast conversion and requires the user to manage the upper bits themselves.

Similarly to how we have both ToVector256 and ToVector256Unsafe. The former explicitly zeros the upper 128-bits and the latter is an effective nop that just allows access to the upper 128-bits of the register.

@danmoseley danmoseley added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Jan 3, 2020
@tannergooding tannergooding force-pushed the as-vector branch 2 times, most recently from e0cda97 to 4dc66de Compare January 6, 2020 21:21
@tannergooding
Copy link
Member Author

CC. @dotnet/jit-contrib

@tannergooding
Copy link
Member Author

Just rebased onto current head. Working on resolving feedback given so far.

@tannergooding
Copy link
Member Author

The current changes results in the following diff (slimmed down to just the modified tests, there were no changes to other tests or framework/runtime assemblies):

Found 8 files with textual diffs.

Summary of Code Size diffs:
(Lower is better)

Total bytes of diff: -2456 (-0.14% of base)
    diff is an improvement.

Top file improvements (bytes):
        -672 : Vector128_1_r.dasm (-0.14% of base)
        -660 : Vector256_1_r.dasm (-0.12% of base)
        -624 : Vector128_1_ro.dasm (-0.19% of base)
        -500 : Vector256_1_ro.dasm (-0.13% of base)

4 total files with Code Size differences (4 improved, 0 regressed), 0 unchanged.

Top method regressions (bytes):
          16 (10.53% of base) : Vector128_1_ro.dasm - VectorAs__AsVector4Single:RunBasicScenario():this

Top method improvements (bytes):
         -66 (-17.19% of base) : Vector128_1_r.dasm - VectorAs__AsVector4Single:RunBasicScenario():this
         -66 (-16.46% of base) : Vector256_1_r.dasm - VectorAs__AsVectorByte:RunBasicScenario():this
         -66 (-16.18% of base) : Vector256_1_r.dasm - VectorAs__AsVectorDouble:RunBasicScenario():this
         -66 (-16.46% of base) : Vector256_1_r.dasm - VectorAs__AsVectorInt16:RunBasicScenario():this
         -66 (-16.46% of base) : Vector256_1_r.dasm - VectorAs__AsVectorInt32:RunBasicScenario():this
         -66 (-16.38% of base) : Vector256_1_r.dasm - VectorAs__AsVectorInt64:RunBasicScenario():this
         -66 (-16.46% of base) : Vector256_1_r.dasm - VectorAs__AsVectorSByte:RunBasicScenario():this
         -66 (-16.18% of base) : Vector256_1_r.dasm - VectorAs__AsVectorSingle:RunBasicScenario():this
         -66 (-16.46% of base) : Vector256_1_r.dasm - VectorAs__AsVectorUInt16:RunBasicScenario():this
         -66 (-16.46% of base) : Vector256_1_r.dasm - VectorAs__AsVectorUInt32:RunBasicScenario():this
         -66 (-16.38% of base) : Vector256_1_r.dasm - VectorAs__AsVectorUInt64:RunBasicScenario():this
         -64 (-24.62% of base) : Vector128_1_ro.dasm - VectorAs__AsVectorByte:RunBasicScenario():this
         -64 (-25.40% of base) : Vector128_1_ro.dasm - VectorAs__AsVectorDouble:RunBasicScenario():this
         -64 (-24.52% of base) : Vector128_1_ro.dasm - VectorAs__AsVectorInt16:RunBasicScenario():this
         -64 (-24.90% of base) : Vector128_1_ro.dasm - VectorAs__AsVectorInt32:RunBasicScenario():this
         -64 (-24.81% of base) : Vector128_1_ro.dasm - VectorAs__AsVectorInt64:RunBasicScenario():this
         -64 (-24.52% of base) : Vector128_1_ro.dasm - VectorAs__AsVectorSByte:RunBasicScenario():this
         -64 (-25.30% of base) : Vector128_1_ro.dasm - VectorAs__AsVectorSingle:RunBasicScenario():this
         -64 (-24.62% of base) : Vector128_1_ro.dasm - VectorAs__AsVectorUInt16:RunBasicScenario():this
         -64 (-24.90% of base) : Vector128_1_ro.dasm - VectorAs__AsVectorUInt32:RunBasicScenario():this

Top method regressions (percentages):
          16 (10.53% of base) : Vector128_1_ro.dasm - VectorAs__AsVector4Single:RunBasicScenario():this

Top method improvements (percentages):
         -64 (-25.40% of base) : Vector128_1_ro.dasm - VectorAs__AsVectorDouble:RunBasicScenario():this
         -64 (-25.30% of base) : Vector128_1_ro.dasm - VectorAs__AsVectorSingle:RunBasicScenario():this
         -64 (-24.90% of base) : Vector128_1_ro.dasm - VectorAs__AsVectorInt32:RunBasicScenario():this
         -64 (-24.90% of base) : Vector128_1_ro.dasm - VectorAs__AsVectorUInt32:RunBasicScenario():this
         -64 (-24.81% of base) : Vector128_1_ro.dasm - VectorAs__AsVectorInt64:RunBasicScenario():this
         -64 (-24.81% of base) : Vector128_1_ro.dasm - VectorAs__AsVectorUInt64:RunBasicScenario():this
         -64 (-24.62% of base) : Vector128_1_ro.dasm - VectorAs__AsVectorByte:RunBasicScenario():this
         -64 (-24.62% of base) : Vector128_1_ro.dasm - VectorAs__AsVectorUInt16:RunBasicScenario():this
         -64 (-24.52% of base) : Vector128_1_ro.dasm - VectorAs__AsVectorInt16:RunBasicScenario():this
         -64 (-24.52% of base) : Vector128_1_ro.dasm - VectorAs__AsVectorSByte:RunBasicScenario():this
         -66 (-17.19% of base) : Vector128_1_r.dasm - VectorAs__AsVector4Single:RunBasicScenario():this
         -66 (-16.46% of base) : Vector256_1_r.dasm - VectorAs__AsVectorByte:RunBasicScenario():this
         -66 (-16.46% of base) : Vector256_1_r.dasm - VectorAs__AsVectorInt16:RunBasicScenario():this
         -66 (-16.46% of base) : Vector256_1_r.dasm - VectorAs__AsVectorInt32:RunBasicScenario():this
         -66 (-16.46% of base) : Vector256_1_r.dasm - VectorAs__AsVectorSByte:RunBasicScenario():this
         -66 (-16.46% of base) : Vector256_1_r.dasm - VectorAs__AsVectorUInt16:RunBasicScenario():this
         -66 (-16.46% of base) : Vector256_1_r.dasm - VectorAs__AsVectorUInt32:RunBasicScenario():this
         -66 (-16.38% of base) : Vector256_1_r.dasm - VectorAs__AsVectorInt64:RunBasicScenario():this
         -66 (-16.38% of base) : Vector256_1_r.dasm - VectorAs__AsVectorUInt64:RunBasicScenario():this
         -66 (-16.18% of base) : Vector256_1_r.dasm - VectorAs__AsVectorDouble:RunBasicScenario():this

42 total methods with Code Size differences (41 improved, 1 regressed), 3634 unchanged.

The diffs are essentially all cases like the following:

-       vmovaps  qword ptr [rsp+A0H], xmm6
-       xor      rax, rax
-       mov      qword ptr [rsp+80H], rax
-       mov      qword ptr [rsp+88H], rax
-       mov      qword ptr [rsp+90H], rax
-       mov      qword ptr [rsp+98H], rax
+       vmovaps  qword ptr [rsp+70H], xmm6
+       vmovaps  qword ptr [rsp+60H], xmm7
+       vextractf128 ymm7, ymm6, 1
        call     VectorAs__AsVectorUInt64:ValidateResult(Vector`1,Vector128`1,String):this
-       vmovupd  ymm0, ymmword ptr[rsp+80H]
-       vmovupd  ymmword ptr[rsp+60H], ymm0
-       vmovupd  xmm6, xmmword ptr [rsp+60H]
-       vmovapd  xmmword ptr [rsp+50H], xmm6
-       vmovupd  ymm0, ymmword ptr[rsp+80H]
-       vmovupd  ymmword ptr[rsp+20H], ymm0
+       vinsertf128 ymm6, ymm6, ymm7, 1
+       vmovaps  ymm0, ymm6
+       vmovapd  xmmword ptr [rsp+20H], xmm0
+       vmovupd  ymmword ptr[rsp+30H], ymm6

@tannergooding tannergooding marked this pull request as ready for review January 22, 2020 21:07
@tannergooding tannergooding changed the title [WIP] Recognize the Vector128/256.AsVector* methods as intrinsic Recognize the Vector128/256.AsVector* methods as intrinsic Jan 22, 2020
@tannergooding
Copy link
Member Author

This should be ready for review now.

I did not treat the currently problematic cases (Vector2/3 to/from Vector128) as intrinsic for now. The other cases (Vector to/from Vector128/256 and Vector4 to/from Vector128) are handled as intrinsics.

Copy link
Contributor

@CarolEidt CarolEidt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's best to avoid the unreached asserts here, but I see that they're all over the place, even in places where there might be a reasonable default action. I'll submit a separate issue to review this.

{
// TYP_SIMD8 and TYP_SIMD12 currently only expose "safe" versions
// which zero the upper elements and so are implemented in managed.
unreached();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will expand to a noway_assert I believe. I think it's best to avoid these where there's a valid default action to take, especially in a non-optimizing phase. It will cause the compile to be re-tried with MinOpts. I think the better option for these cases is to have an assert such as assert(!"Unexpected intrinsic in impBaseIntrinsic"); followed by a break, which I believe will cause it to return nullptr as for an unsupported or unrecognized intrinsic.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems reasonable. I'm going to go ahead and merge the PR as is and then will follow it up with a PR that adjusts the unreached() calls I can find in the hwintrinsic* files

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants