Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize library code using arm64 intrinsics #33308

Closed
BruceForstall opened this issue Mar 6, 2020 · 37 comments
Closed

Optimize library code using arm64 intrinsics #33308

BruceForstall opened this issue Mar 6, 2020 · 37 comments
Labels
arch-arm64 area-Meta Epic Groups multiple user stories. Can be grouped under a theme. help wanted [up-for-grabs] Good issue for external contributors
Milestone

Comments

@BruceForstall
Copy link
Member

BruceForstall commented Mar 6, 2020

The following classes/functions in the libraries have Intel x86/x64 intrinsics usage. These are where _ISA_.IsSupported() is called. This information was collected manually and might not be complete. Some of these function names represent many overloads. There are some vectorized helper methods not shown here -- where a function calls IsSupported and then calls a specific helper function to do the actual work, such as for SSE2 or AVX2 specifically. There are other cases where Vector<T> is used, but arm64 already supports that (it should be verified that the arm64 Vector<T> code is complete and performant).

When each of these has added an arm64-specific intrinsics optimization, it should be "checked off".

The sections below are ordered in the presumed priority order that they should be implemented in. (There is no assumed priority order for the individual functions in each section.)

It is expected that System.Collections.BitArray, System.Numerics, and System.SpanHelpers will be "arm64 intrinsi-fied" for .NET 5. If possible, System.Buffers and System.Text will as well, but that is not considered required.

System.Collections.BitArray #33309

  • System.Collections.BitArray - constructor
  • System.Collections.BitArray.And()
  • System.Collections.BitArray.Or()
  • System.Collections.BitArray.Xor()
  • System.Collections.BitArray.Not()
  • System.Collections.BitArray.CopyTo()

System.Runtime.Intrinsics #33496

Vector64

  • As<T, U>()
  • AsInt64()
  • AsUInt64()
  • AsDouble()
  • CreateScalarUnsafe(int value);
  • CreateScalarUnsafe(uint value);
  • CreateScalarUnsafe(float value);
  • CreateScalarUnsafe(byte value);
  • CreateScalarUnsafe(sbyte value);
  • CreateScalarUnsafe(short value);
  • CreateScalarUnsafe(ushort value);
  • CreateScalar(uint)
  • CreateScalar(float)
  • CreateScalar(sbyte)
  • CreateScalar(ushort)
  • CreateScalar(short)
  • CreateScalar(byte)
  • CreateScalar(int)
  • Create(sbyte, sbyte, sbyte, sbyte, sbyte, sbyte, sbyte, sbyte)
  • Create(byte, byte, byte, byte, byte, byte, byte, byte)
  • Create(ushort, ushort, ushort, ushort)
  • Create(short, short, short, short)
  • Create(float, float)
  • Create(int, int)
  • Create(ulong)
  • Create(uint)
  • Create(uint, uint)
  • Create(float)
  • Create(sbyte)
  • Create(long)
  • Create(int)
  • Create(short)
  • Create(double)
  • Create(byte)
  • Create(ushort)
  • GetElement(int index)
  • ToScalar()
  • ToVector128()
  • ToVector128Unsafe()
  • WithElement(Vector64, int, T)

Vector128

  • As<T, U>()
  • AsVector()
  • AsVector4(Vector128)
  • AsVector128(Vector)
  • AsVector128(Vector4)
  • CreateScalarUnsafe(int value);
  • CreateScalarUnsafe(uint value);
  • CreateScalarUnsafe(float value);
  • CreateScalarUnsafe(long value);
  • CreateScalarUnsafe(ulong value);
  • CreateScalarUnsafe(double value);
  • CreateScalarUnsafe(byte value);
  • CreateScalarUnsafe(sbyte value);
  • CreateScalarUnsafe(short value);
  • CreateScalarUnsafe(ushort value);
  • CreateScalar(ulong)
  • CreateScalar(uint)
  • CreateScalar(ushort)
  • CreateScalar(sbyte)
  • CreateScalar(float)
  • CreateScalar(int)
  • CreateScalar(short)
  • CreateScalar(double)
  • CreateScalar(byte)
  • CreateScalar(long)
  • Create(sbyte, sbyte, sbyte, sbyte, sbyte, sbyte, sbyte, sbyte, sbyte, sbyte, sbyte, sbyte, sbyte, sbyte, sbyte, sbyte)
  • Create(byte, byte, byte, byte, byte, byte, byte, byte, byte, byte, byte, byte, byte, byte, byte, byte)
  • Create(ushort, ushort, ushort, ushort, ushort, ushort, ushort, ushort)
  • Create(short, short, short, short, short, short, short, short)
  • Create(uint, uint, uint, uint)
  • Create(float, float, float, float)
  • Create(int, int, int, int)
  • Create(ulong, ulong)
  • Create(Vector64, Vector64)
  • Create(Vector64, Vector64)
  • Create(Vector64, Vector64)
  • Create(Vector64, Vector64)
  • Create(Vector64, Vector64)
  • Create(Vector64, Vector64)
  • Create(Vector64, Vector64)
  • Create(Vector64, Vector64)
  • Create(byte)
  • Create(double)
  • Create(short)
  • Create(int)
  • Create(long)
  • Create(Vector64, Vector64)
  • Create(float)
  • Create(sbyte)
  • Create(uint)
  • Create(ulong)
  • Create(double, double)
  • Create(long, long)
  • Create(Vector64, Vector64)
  • Create(ushort)
  • GetElement(int index)
  • GetLower()
  • GetUpper()
  • WithElement(Vector128, int, T)
  • WithLower(Vector64)
  • WithUpper(Vector64)
  • ToScalar()

Vector256

  • Software fallback

System.Numerics

System.Numerics.BitOperations #33495

  • System.Numerics.BitOperations.LeadingZeroCount()
  • System.Numerics.BitOperations.Log2()
  • System.Numerics.BitOperations.PopCount()
  • System.Numerics.BitOperations.TrailingZeroCount()

System.Numerics.Matrix4x4 #33565

  • System.Numerics.Matrix4x4.Transpose()
  • System.Numerics.Matrix4x4.Lerp()
  • System.Numerics.Matrix4x4.operator-()
  • System.Numerics.Matrix4x4.operator+()
  • System.Numerics.Matrix4x4.operator*()
  • System.Numerics.Matrix4x4.operator==()
  • System.Numerics.Matrix4x4.operator!=()

System.SpanHelpers #33707

  • System.SpanHelpers.IndexOf(byte)
  • System.SpanHelpers.IndexOf(char)
  • System.SpanHelpers.IndexOfAny(byte)
    [ ] System.SpanHelpers.SequenceCompareTo(byte) (SIMD vector implementation is fast enough)
    [ ] System.SpanHelpers.SequenceEqual(byte) (SIMD vector implementation is fast enough)
    [ ] System.SpanHelpers.LocateFirstFoundByte() (Only used by SIMD version of IndexOf and IndexOfAny which are already optimized by ARM64 intrinsics)

System.Buffers #35033

(Not completed in 5.0.0; moved to 6.0.0)

  • System.Buffers.Text.Base64.DecodeFromUtf8()
  • System.Buffers.Text.Base64.EncodeToUtf8()

System.Text

System.Text.ASCIIUtility #35034

(Not completed in 5.0.0; #41292 contains the items moved to 6.0.0)

  • System.Text.ASCIIUtility.GetIndexOfFirstNonAsciiByte()
  • System.Text.ASCIIUtility.GetIndexOfFirstNonAsciiChar() - (Not completed in 5.0.0; moved to 6.0.0) - PR Get index of first non ascii char #39507
  • System.Text.ASCIIUtility.NarrowFourUtf16CharsToAsciiAndWriteToBuffer()
  • System.Text.ASCIIUtility.NarrowUtf16ToAscii() - (Not completed in 5.0.0; moved to 6.0.0) - PR Narrow utf16 to ascii #39509
  • System.Text.ASCIIUtility.WidenAsciiToUtf16()
  • System.Text.ASCIIUtility.WidenFourAsciiBytesToUtf16AndWriteToBuffer()
  • System.Text.ASCIIUtility.CountNumberOfLeadingAsciiBytesFromUInt32WithSomeNonAsciiData()

System.Text.Unicode #35035

  • System.Text.Unicode.Utf16Utility.GetPointerToFirstInvalidChar()
  • System.Text.Unicode.Utf8Utility.TranscodeToUtf8()
  • System.Text.Unicode.Utf8Utility.GetPointerToFirstInvalidByte()

System.Text.Encodings.Web #35036

  • System.Text.Encodings.Web.DefaultJavaScriptEncoder.FindFirstCharacterToEncodeUtf8()
  • System.Text.Encodings.Web.DefaultJavaScriptEncoderBasicLatin.FindFirstCharacterToEncode()
  • System.Text.Encodings.Web.DefaultJavaScriptEncoderBasicLatin.FindFirstCharacterToEncodeUtf8()
  • System.Text.Encodings.Web.TextEncoder.FindFirstCharacterToEncodeUtf8()
  • System.Text.Encodings.Web.UnsafeRelaxedJavaScriptEncoder.WillEncode()
  • System.Text.Encodings.Web.UnsafeRelaxedJavaScriptEncoder.FindFirstCharacterToEncodeUtf8()
@jkotas
Copy link
Member

jkotas commented Mar 6, 2020

System.Text System.Text.ASCIIUtility, System.Text.Unicode and System.SpanHelpers are the most important ones for Web workloads.

@danmoseley danmoseley added the help wanted [up-for-grabs] Good issue for external contributors label Mar 6, 2020
@BruceForstall
Copy link
Member Author

@jkotas It's my understanding that ASP.NET is currently not running benchmarks on arm64. Hopefully that will change, then we'd be able to see benefits from optimizing these.

@BruceForstall
Copy link
Member Author

Also, BitArray is chosen to be first to implement because it's simple, has benchmarks defined, and can easily be used as a proof-of-concept of the basics of the intrinsics implementation -- not because it's the most important class.

@tannergooding
Copy link
Member

CC. @CarolEidt, @echesakovMSFT, @GrabYourPitchforks, @TamarChristinaArm

@Gnbrkm41
Copy link
Contributor

Gnbrkm41 commented Mar 7, 2020

Would love to help this! Though, I remember majority of the ARM intrinsics being unavailable last time I looked up (only available via experimental NuGet package). Has this been addressed? I have seen a few API review sessions where ARM intrinsics are discussed but I don't know if they are implemented.

@Gnbrkm41
Copy link
Contributor

Gnbrkm41 commented Mar 7, 2020

One concern is that I can't quite run tests and benchmarks locally since I do not own an ARM machine (well, I do - but it's an RPi 3B+). I know those exist as part of CI/CD but are the perf numbers available to non-MSFT people?

@BruceForstall
Copy link
Member Author

I have seen a few API review sessions where ARM intrinsics are discussed but I don't know if they are implemented.

The implementation is in progress, so there will be feature gaps.

One concern is that I can't quite run tests and benchmarks locally since I do not own an ARM machine (well, I do - but it's an RPi 3B+). I know those exist as part of CI/CD but are the perf numbers available to non-MSFT people?

That's a bigger problem. The CI will do feature testing on ARM64, but there is currently no perf testing available to non-MSFT (and minimal available internal). You probably don't want to depend on the CI to do all functional testing for you. I think it's possible to use the RPi for this if you install, say, Ubuntu 64-bit.

@CarolEidt
Copy link
Contributor

I think it's possible to use the RPi for this if you install, say, Ubuntu 64-bit.

I have successfully run perf tests on a RPi 4 with Ubuntu 64-bit; I assume it would also work on the RPi 3B+.

@BruceForstall
Copy link
Member Author

cc @jeffschwMSFT

@tannergooding
Copy link
Member

System.Numerics.Intrinsics #33496, this (and the namespaces under it) should be System.Runtime.Intrinsics.

@Gnbrkm41
Copy link
Contributor

I think we also could add System.Buffers.Binary.BinaryPrimitives.ReverseEndianness in the list as well. Trivial search on CoreCLR sources do not seem to suggest that they are intrinsic; however I may as well be wrong.

@tannergooding
Copy link
Member

For System.Runtime.Intrinsics (#33496), it is correct that Vector256<T> is unsupported on ARM/ARM64. The functions already have a software fallback however, so no work is needed there.

However, ARM/ARM64 do support Vector64<T> (unlike x86) and functions which are special cased on Vector128<T> should be mirrored down. Likewise, there are functions like Vector128.GetLower which should be specialized (either in the JIT or in software).

@EgorBo
Copy link
Member

EgorBo commented Mar 13, 2020

I attempted to use Arm64 intrinsics to optimize System.Numerics.BitOperations.LeadingZeroCount but it was too early :-) https://github.com/dotnet/coreclr/pull/26815/files (tested on a real device)

@benaadams
Copy link
Member

Also SpanHelpers.SequenceEqual<byte> after #32371

@BruceForstall
Copy link
Member Author

At this time, we don't believe there will be enough time (or resources) to implement the following optimizations for .NET 5:

System.Numerics.Matrix4x4 #33565
System.Text.ASCIIUtility #35034
System.Text.Unicode #35035
System.Text.Encodings.Web #35036

As @jkotas writes above, this will leave some potential web workload performance improvements on the table.

cc @tannergooding @danmosemsft

@tannergooding
Copy link
Member

There is a meeting Monday morning for me to onboard 3-4 others with porting these remaining types. #33565 is one that I'll be using to help walk them through the process of porting the other code.

What is the current deadline for getting these changes in by?

CC. @jeffhandley

@BruceForstall
Copy link
Member Author

@tannergooding That's great to hear. The deadline is whatever your team chooses as the deadline, for how you manage 5.0 work. The CLR CodeGen team doesn't expect to have enough time/resources to these ourselves for 5.0 (we chose to do lots of this libraries work initially, to validate the intrinsics work and seed the optimizations), but certainly other teams/people can do them instead.

@danmoseley
Copy link
Member

What is the current deadline for getting these changes in by?

The platform teams (us) are aiming to be feature complete by Preview 8 snap which is July 15 (internal link)

@jeffhandley
Copy link
Member

@BruceForstall - we'll keep you updated on our planning and progress after we kick off the work next Monday.

@danmoseley
Copy link
Member

@tannergooding what is our test strategy for all these -- presumably we're just relying on our regular test bed and having enough hardware variation.

In particular, how do we get coverage on the software fallback paths? In the past, we've said that the ARM machines cover this path for us.

@BruceForstall
Copy link
Member Author

@danmosemsft We have two AzDO pipelines for stressing the ISAs, which includes disabling hardware intrinsics:

runtime-coreclr jitstress-isas-arm: https://dev.azure.com/dnceng/public/_build?definitionId=665
runtime-coreclr jitstress-isas-x86: https://dev.azure.com/dnceng/public/_build?definitionId=664

These are defined using the following modes (in eng\pipelines\common\templates\runtimes
\run-test-job.yml, with the tags defined in src\coreclr\tests\testenvironment.proj):

${{ if in(parameters.testGroup, 'jitstress-isas-arm') }}:
  scenarios:
  - jitstress_isas_incompletehwintrinsic
  - jitstress_isas_nohwintrinsic
  - jitstress_isas_nohwintrinsic_nosimd
  - jitstress_isas_nosimd
${{ if in(parameters.testGroup, 'jitstress-isas-x86') }}:
  scenarios:
  - jitstress_isas_incompletehwintrinsic
  - jitstress_isas_nohwintrinsic
  - jitstress_isas_nohwintrinsic_nosimd
  - jitstress_isas_nosimd
  - jitstress_isas_x86_noaes
  - jitstress_isas_x86_noavx
  - jitstress_isas_x86_noavx2
  - jitstress_isas_x86_nobmi1
  - jitstress_isas_x86_nobmi2
  - jitstress_isas_x86_nofma
  - jitstress_isas_x86_nohwintrinsic
  - jitstress_isas_x86_nolzcnt
  - jitstress_isas_x86_nopclmulqdq
  - jitstress_isas_x86_nopopcnt
  - jitstress_isas_x86_nosse
  - jitstress_isas_x86_nosse2
  - jitstress_isas_x86_nosse3
  - jitstress_isas_x86_nosse3_4
  - jitstress_isas_x86_nosse41
  - jitstress_isas_x86_nosse42
  - jitstress_isas_x86_nossse3

@tannergooding can comment on whether that is still sufficient for x86/x64, arm, as well as scenarios like R2R. Note that ARM32 doesn't support hardware intrinsics, so requires the fallback path.

@danmoseley
Copy link
Member

@jeffhandley is work planned for 5.0 complete? If so please let's close this now.

@kunalspathak
Copy link
Member

@danmosemsft - There are still some APIs under System.Text.ASCIIUtility and System.Buffers that are not optimized. You can check the status of those methods in the PR description.

@jeffhandley
Copy link
Member

@danmosemsft I'm aggregating the status of these and will make a decision today whether the remaining items should be moved to 6.0.

@benaadams
Copy link
Member

JsonReaderHelper.IndexOfOrLessThan is available with Arm intrinsics #41097 (though discussion about packaging for wasm; which could always be dropped out)

@jeffhandley
Copy link
Member

Here's the list of intrinsics efforts not merged into release/5.0. I recommend that we move them all to 6.0.

I will take the following actions:

  1. Update the text of this issue description for the Buffers and ASCIIUtility items to note that those will be completed in 6.0
  2. Move Optimize System.Buffers for arm64 using cross-platform intrinsics #35033 (Buffers) to the 6.0.0 milestone
  3. Comment on Optimize System.Text.ASCIIUtility using arm64 intrinsics #35034 (ASCIIUtility) which methods were not completed in 5.0.0, and then close that issue
  4. Create a new issue for the remaining ASCIIUtility methods

@BruceForstall
Copy link
Member Author

This all looks good to me. Thanks @jeffhandley for summarizing and closing out the 5.0 release.

And thanks everyone for the great work with all these arm64 improvements!

@danmoseley
Copy link
Member

Agreed. And I think we can please close this then.

And thanks everyone for the great work with all these arm64 improvements!

+100 ! and appreciation for @BruceForstall for establishing clarity originally by creating this list.

@jeffhandley
Copy link
Member

Aforementioned actions taken. #41292 will track the remaining ASCIIUtility methods in 6.0.0 and #35033 was moved to 6.0.0.

Hardware Intrinsics automation moved this from To do arm64 to Done Aug 24, 2020
@ghost ghost locked as resolved and limited conversation to collaborators Dec 10, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
arch-arm64 area-Meta Epic Groups multiple user stories. Can be grouped under a theme. help wanted [up-for-grabs] Good issue for external contributors
Projects
Development

No branches or pull requests