Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement the last of the approved cross platform hardware intrinsics, except shuffle #63414

Merged
merged 19 commits into from Jan 13, 2022

Conversation

tannergooding
Copy link
Member

This resolves #20510 and everything except Shuffle from #63331

Shuffle is going to be a separate PR as its:

  1. Not immediately needed for most of the BCL logic
  2. The logic to enable it is quite a bit more complicated as it needs to recognize and specially handle various vector constants for each platform

@dotnet-issue-labeler
Copy link

Note regarding the new-api-needs-documentation label:

This serves as a reminder for when your PR is modifying a ref *.cs file and adding/modifying public APIs, to please make sure the API implementation in the src *.cs file is documented with triple slash comments, so the PR reviewers can sign off that change.

@ghost
Copy link

ghost commented Jan 5, 2022

Tagging subscribers to this area: @dotnet/area-system-runtime-intrinsics
See info in area-owners.md if you want to be subscribed.

Issue Details

This resolves #20510 and everything except Shuffle from #63331

Shuffle is going to be a separate PR as its:

  1. Not immediately needed for most of the BCL logic
  2. The logic to enable it is quite a bit more complicated as it needs to recognize and specially handle various vector constants for each platform
Author: tannergooding
Assignees: -
Labels:

area-System.Runtime.Intrinsics, new-api-needs-documentation

Milestone: -

@tannergooding
Copy link
Member Author

This is ready for review.

@EgorBo EgorBo mentioned this pull request Jan 9, 2022
9 tasks
src/coreclr/jit/gentree.cpp Outdated Show resolved Hide resolved
src/coreclr/jit/gentree.cpp Show resolved Hide resolved
src/coreclr/jit/gentree.cpp Show resolved Hide resolved
src/coreclr/jit/gentree.cpp Show resolved Hide resolved
src/coreclr/jit/gentree.cpp Show resolved Hide resolved
src/coreclr/jit/gentree.cpp Show resolved Hide resolved
src/coreclr/jit/gentree.cpp Show resolved Hide resolved
{
if (!varTypeIsLong(simdBaseType))
{
op1 = gtNewSimdHWIntrinsicNode(TYP_SIMD8, op1, NI_AdvSimd_Arm64_AddAcross, simdBaseJitType,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment regarding using AddAcross versus many AddPairwise

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is roughly the same scenario as #63414 (comment). We need a proper sum here that fits in 32-bits.

There are some other potential optimization opportunities here, but they would require handling in lowering and aren't "critical" to getting the feature out, so it's likely better to handle them in a separate PR.

  • These would involve specially handling the input and whether or not its an intrinsic that is known to produce an all-bits set vs no-bits set per-element result. In that scenario, we can elide the shift in favor of masking the right bits directly and then use two AddPairwise, which may be better.
  • The same applies to certain codegen for x86/x64 (but not ExtractMostSignificantBits, since that's just MoveMask there) where we could optimize to just things like Blend instead

op2 = gtNewOperNode(GT_ADD, op1->TypeGet(), op1, op2);
}

retNode = gtNewSimdHWIntrinsicNode(retType, op2, op1, NI_AdvSimd_Store, simdBaseJitType, simdSize);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this could be implemented more efficiently if we were using Store SIMD&FP register (register offset). str instruction:

str Dt, op1Reg, op2Reg, LSL #Log2(genTypeSize(simdBaseType))
str Qt, op1Reg, op2Reg, LSL #Log2(genTypeSize(simdBaseType))

Should we introduce an intrinsic under AdvSimd that exposes this instruction?

Or instead introduce an optimization in lower that "contains" such address expression and handle the case in CodeGen?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be a general thing done as part of Lowering so that any SIMD store can be optimized accordingly, just as we already do for x64.

src/coreclr/jit/hwintrinsicarm64.cpp Show resolved Hide resolved
Copy link
Contributor

@echesakov echesakov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The JIT changes LGTM.


for (int index = 0; index < Vector<byte>.Count; index++)
{
var element = Scalar<byte>.ShiftLeft(value.GetElementUnsafe(index), shiftCount);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: can we avoid using var here and elsewhere? This is a good example of where I don't actually know what type element is.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can do a separate PR to fix it up for all the vector code. It has (and has for a long time) used var fairly extensively to help make copying the algorithms around easier.

The "type" really doesn't matter, the action being done does and that's generally guaranteed by the generic context (e.g. Scalar<byte> means its operating on byte).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can do a separate PR to fix it up for all the vector code

Thanks

return result;
}

/// <summary>Shifts each element of a vector right by the specified amount.</summary>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The docs for ShiftRightArithmetic and ShiftRightLogical are identical. Is that intentional?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Partially. I don't have a good alternative wording to differentiate and explain the differences between arithmetic (signed) and logical (unsigned) shifting in the space of a doc summary.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have a good alternative wording to differentiate and explain the differences between arithmetic (signed) and logical (unsigned) shifting in the space of a doc summary.

"Arithmetic shifts each element" and "Logical shifts each element"?
or
"Shift (arithmetic) each element" and "Shifts (logical) each element"?
or
"Shifts (signed) each element" and "Shifts (unsigned) each element"?
or something along those lines? It'd be nice to differentiate them.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think Shift (signed) is probably my favorite out of those. If you don't have any push back, I'll do that in the follow up PR I mentioned above (to avoid churning CI just for a docs change and to unblock the downstream work dependent on this PR).

}
else if (typeof(T) == typeof(nint))
{
if (Environment.Is64BitProcess)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When are we choosing to use

#if TARGET_64BIT

vs

if (Environment.Is64BitProcess)

for this code in corelib?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code in here largely uses Environment because it wasn't always in corelib and so it continues using it for consistency.

This is fallback logic and isn't perf critical (it basically only runs on x86 Unix and Arm32) so I don't think its super important to minimize the IL even more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is fallback logic and isn't perf critical (it basically only runs on x86 Unix and Arm32) so I don't think its super important to minimize the IL even more.

That leaves me wondering:

  1. We're using things like Unsafe.SkipInit in code that's not perf critical?
  2. Does this fallback logic get trimmed out on platforms where it's not used?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It still gets used in in indirect calls (delegates/fnptrs/reflection) and so we still need an implementation. This just happens to be the easiest way to do it (mostly because the backing fields are ulong, not T).

These are all generics over structs and so all irrelevant code paths are trimmed for each scenario. That is, byte won't have paths for sbyte, short, ushort, int, uint, long, ulong, float, double, nint, or nuint. Likewise, Environment.Is64BitProcess is also constant folded by both the JIT and AOT, there isn't actual "cost" here (other than an extra branch preserved in IL).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are all generics over structs and so all irrelevant code paths are trimmed for each scenario

You're talking about the asm. I'm wondering about the IL and the linker/trimmer, i.e. will System.Private.Corelib.dll be identical in size whether the ifdef is used or the Is64BitProcess check is used (which would require the linker to recognize it and eliminate the dead code based on it).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That I don't know. I know the linker has some understanding of Is64BitProcess but I'm not sure under what contexts its used vs not.

Comment on lines +2911 to +2915
long* value = null;

try
{
value = (long*)NativeMemory.AlignedAlloc(byteCount: 16, alignment: 16);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason not to condense it to:

Suggested change
long* value = null;
try
{
value = (long*)NativeMemory.AlignedAlloc(byteCount: 16, alignment: 16);
long* value = (long*)NativeMemory.AlignedAlloc(byteCount: 16, alignment: 16);
try
{

?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a habit of putting the thing that is impacted by the catch in the try to differentiate on what failed and avoid issues with SkipLocalsInit, etc.

Copy link
Member

@stephentoub stephentoub left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I skimmed the .cs files. Other than my comments, LGTM.

@tannergooding
Copy link
Member Author

tannergooding commented Jan 12, 2022

Logged #63704 to track the managed code cleanup. I plan on doing that basically as soon as this is merged, so it won't stay on some "indefinite" backlog.

@tannergooding tannergooding merged commit cfe5e98 into dotnet:main Jan 13, 2022
@EgorBo
Copy link
Member

EgorBo commented Jan 25, 2022

Improvement on Windows-x64 dotnet/perf-autofiling-issues#2944

@dotnet dotnet locked as resolved and limited conversation to collaborators Feb 25, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

API Proposal: ShiftLeft, ShiftRight for Vector<T>
7 participants