Optimize System.Buffers for arm64 using cross-platform intrinsics #35033

BruceForstall · 2020-04-16T00:43:04Z

This item tracks the conversion of the System.Buffers class to use arm64 intrinsics.

System.Buffers.Text.Base64.DecodeFromUtf8()
System.Buffers.Text.Base64.EncodeToUtf8()

Update for .NET 7
Now that we have cross-platform intrinsics APIs (#49397), these optimizations should be completed using those APIs instead of adding ARM64-specific intrinsics code paths. The effort could optionally include first measuring performance of these methods with the ARM64-specific intrinsics in place, and then measuring the performance of these methods with the cross-platform intrinsics.

Example use of the new cross-platform intrinsics: #63722

Related: #33308

ghost · 2020-04-16T00:43:06Z

Tagging subscribers to this area: @tannergooding
Notify danmosemsft if you want to be subscribed.

john-h-k · 2020-04-23T22:55:58Z

I can give this a shot

john-h-k · 2020-04-24T13:29:45Z

These ARM64 intrinsics are missing to make the code "worthwhile":

~~Tracked by #32512:~~ Done!

UnsignedSaturatingSubtract - vqsubq_u8 - UQSUB

Tracked by #33398:

UnsignedShiftRight - vshrq_n_u8 - USHR
UnsignedShiftLeft - vshlq_n_u8 - USHL

~~Tracked by #1277:~~ Done!

TableLookup - vqtbl4q_u8 - TBL
TableLookupExtension - vqtbx4q_u8 - TBX

These helper methods could be accelerated to help it and are tracked in #33496:

Vector128.Create(byte)

These intrinsics could be introduced to maximise perf:

Load 3 Vector128<T> - vld3q_u8 - LD3
Store 3 Vector128<T> - vst3q_u8 - ST3
Load 4 Vector128<T> - vld4q_u8 - LD4
Store 4 Vector128<T> - vst4q_u8 - ST4

john-h-k · 2020-04-24T16:11:32Z

john-h-k@adeea36

Draft commit, most that can be done until necessary intrinsics are added

tannergooding · 2020-04-24T16:18:14Z

Thanks @john-h-k, as per our conversation on Discord, I'll ping you once the necessary intrinsics are added so you can make further progress.

BruceForstall · 2020-06-03T01:36:40Z

@tannergooding @john-h-k The shift intrinsics went in with #36830. There have been many improvements to System.Runtime.Intrinsics optimization in #33496. Is this work now unblocked?

tannergooding · 2020-06-03T01:41:35Z

Yes, this should be unblocked now.

BruceForstall · 2020-06-11T20:19:48Z

@john-h-k Are you planning to try and finish this for .NET 5, now that it's unblocked?

tannergooding · 2020-06-11T20:21:11Z

If not, we'll have 3-4 people free to help tackle the remaining issues from #33308 on Monday.

CC. @jeffhandley

john-h-k · 2020-06-11T21:34:11Z

Yes, I should be. I just need to benchmark the 2 possible impls I've got locally, either on my old Pi2b or I can get a friend to do it on newer device

BruceForstall · 2020-08-03T20:21:46Z

@tannergooding @jeffhandley What's the plan for this for .NET 5?

jeffhandley · 2020-08-03T20:32:12Z

@tannergooding is planning to carry this one over the finish line ahead of RC1.

jeffhandley · 2020-08-24T20:30:32Z

We could not get this completed in time for 5.0.0; moved to 6.0.0.

kunalspathak · 2022-03-28T16:37:50Z

@a74nh - I am assigning this to you. Feel free to reassign to someone who will work on this.

Edit: For some reason, I am not able to assign this.

gfoidl · 2022-03-28T16:46:01Z

Base64 relies heavily on shuffling, so shuffle intrinsics are a prerequisite for this issue.
Is shuffling a xplat-intrinsics now? At least on .NET 7 Preview 2 I can't find it.

tannergooding · 2022-03-28T16:49:57Z

Is shuffling a xplat-intrinsics now? At least on .NET 7 Preview 2 I can't find it.

I'm actively working on the support locally. Hoping to get it done soon but needing to balance that work against the Generic Math work.

kunalspathak · 2022-03-28T17:18:11Z

I'm actively working on the support locally.

#63331

a74nh · 2022-03-29T15:30:14Z

@a74nh - I am assigning this to you. Feel free to reassign to someone who will work on this.

Edit: For some reason, I am not able to assign this.

I can't assign this either. It should go to @SwapnilGaikwad instead of me.

EgorBo · 2022-03-30T15:15:36Z

Base64 relies heavily on shuffling, so shuffle intrinsics are a prerequisite for this issue. Is shuffling a xplat-intrinsics now? At least on .NET 7 Preview 2 I can't find it.

in #67192 for hex encoding I introduced a temp helper till #63331 is landed 🙂

gfoidl · 2022-03-30T15:24:17Z

IMO it's better to wait for #63331
Otherwise the code has to be touched again to revert the helper-change.

I'm also going to update https://github.com/gfoidl/Base64, and there by a scheduled CI-trigger fuzz-tests are run, so any fault should get caught early enough for .NET 7 RTM.

kunalspathak · 2022-05-20T14:20:18Z

Looks like Shuffle has landed in #68559 so this can get started? Note, as called out in #67339 - lot of Base64Encode benchmarks like System.Buffers.Text.Tests.Base64Tests.Base64Encode(NumberOfBytes: 1000) are up to 16 times slower. So we should fix this sooner to make sure it lands in .NET 7.

@a74nh , @SwapnilGaikwad

tannergooding · 2022-05-20T14:31:09Z

Noting that only the Shuffle(x, mask) API has landed and it still has a few quirks, but should be generally usable especially with constant inputs.

The Shuffle(x, y, mask) API is going to need #68874 (which needs a couple bits of PR feedback handled and a final review).

a74nh · 2022-05-20T14:31:42Z

We are looking at this....

With a recent HEAD, took the ss2 C# code and (mostly) updated it to use Vector128.
On Arm64, most things generated into neon code ... but for Vector128.Shuffle() it generated to a function call:

Annotated IR dump of part of the function:

store str
IN0024: 0000B0                    str     q0, [fp,#144]	// [V17 loc10]

Load str
IN0025: 0000B4                    ldr     q0, [fp,#144]	// [V17 loc10]

Vector128.ShiftRightLogical(str.AsInt32(), 4).AsSByte(),
IN0026: 0000B8                    ushr    v0.4s, v0.4s, #4

Load mask2F
IN0027: 0000BC                    ldr     q1, [fp,#224]	// [V11 loc4]

Vector128<sbyte> hiNibbles = Vector128.BitwiseAnd
IN0028: 0000C0                    and     v0.16b, v0.16b, v1.16b

Store back to hiNibbles
IN0029: 0000C4                    str     q0, [fp,#128]	// [V18 loc11]

Load str (again!)
IN002a: 0000C8                    ldr     q0, [fp,#144]	// [V17 loc10]

Load mask2F (again!)
IN002b: 0000CC                    ldr     q1, [fp,#224]	// [V11 loc4]

Vector128<sbyte> loNibbles = Vector128.BitwiseAnd(str, mask2F);
IN002c: 0000D0                    and     v0.16b, v0.16b, v1.16b

Store loNibbles
IN002d: 0000D4                    str     q0, [fp,#112]	// [V19 loc12]

Load lutHi
IN002e: 0000D8                    ldr     q0, [fp,#288]	// [V07 loc0]

Load hiNibbles
IN002f: 0000DC                    ldr     q1, [fp,#128]	// [V18 loc11]

Function call to shuffle
IN0030: 0000E0                    movz    x0, #0xb7c8
IN0031: 0000E4                    movk    x0, #0x7e74 LSL #16
IN0032: 0000E8                    movk    x0, #0xffff LSL #32
IN0033: 0000EC                    ldr     x0, [x0]
IN0034: 0000F0                    blr     x0
IN0035: 0000F4                    str     q0, [fp,#96]	// [V20 loc13]

Load lutLo
IN0036: 0000F8                    ldr     q0, [fp,#272]	// [V08 loc1]

Load loNibbles
IN0037: 0000FC                    ldr     q1, [fp,#112]	// [V19 loc12]

Function call to shuffle
IN0038: 000100                    movz    x0, #0xb7c8
IN0039: 000104                    movk    x0, #0x7e74 LSL #16
IN003a: 000108                    movk    x0, #0xffff LSL #32
IN003b: 00010C                    ldr     x0, [x0]
IN003c: 000110                    blr     x0
IN003d: 000114                    str     q0, [fp,#80]	// [V21 loc14]

a74nh · 2022-05-20T14:35:04Z

The above dump was using
public static Vector128 Shuffle(Vector128 vector, Vector128 indices)

tannergooding · 2022-05-20T14:36:12Z

On Arm64, most things generated into neon code ... but for Vector128.Shuffle() it generated to a function call

Can you share the managed code that was being used here, preferably including the Shuffle call and how its arguments are being passed in?

a74nh · 2022-05-20T14:50:59Z

This code was made by taking Ssse3Decode(), replacing Sse2 with Vector128, then fixing up the function calls. Quite possible there's an obvious mistake on my end.

        // The JIT won't hoist these "constants", so help it
        Vector128<sbyte> lutHi = Vector128.Create(
            0x10, 0x10, 0x01, 0x02,
            0x04, 0x08, 0x04, 0x08,
            0x10, 0x10, 0x10, 0x10,
            0x10, 0x10, 0x10, 0x10);

        Vector128<sbyte> lutLo = Vector128.Create(
            0x15, 0x11, 0x11, 0x11,
            0x11, 0x11, 0x11, 0x11,
            0x11, 0x11, 0x13, 0x1A,
            0x1B, 0x1B, 0x1B, 0x1A);

        Vector128<sbyte> lutShift = Vector128.Create(
            0, 16, 19, 4,
            -65, -65, -71, -71,
            0, 0, 0, 0,
            0, 0, 0, 0);

        Vector128<sbyte> packBytesMask = Vector128.Create(
            2, 1, 0, 6,
            5, 4, 10, 9,
            8, 14, 13, 12,
            -1, -1, -1, -1);

        Vector128<sbyte> mask2F = Vector128.Create((sbyte)'/');
        Vector128<sbyte> mergeConstant0 = Vector128.Create(0x01400140).AsSByte();
        Vector128<short> mergeConstant1 = Vector128.Create(0x00011000).AsInt16();
        Vector128<sbyte> zero = Vector128<sbyte>.Zero;

        byte* src =  srcBytes;
        byte* dest = destBytes;



            Vector128<sbyte> str = Vector128.LoadUnsafe(ref *src).AsSByte();

            // lookup
            Vector128<sbyte> hiNibbles = Vector128.BitwiseAnd(Vector128.ShiftRightLogical(str.AsInt32(), 4).AsSByte(), mask2F);
            Vector128<sbyte> loNibbles = Vector128.BitwiseAnd(str, mask2F);
            Vector128<sbyte> hi = Vector128.Shuffle(lutHi, hiNibbles);
            Vector128<sbyte> lo = Vector128.Shuffle(lutLo, loNibbles);

a74nh · 2022-05-23T18:59:09Z

Although - Lack of an Arm64 version of shuffle isn't surprising given there is not obvious match to the instruction

a74nh · 2022-05-23T19:04:08Z

From the top of Base64Decoder.cs:

// AVX2 version based on https://github.com/aklomp/base64/tree/e516d769a2a432c08404f1981e73b431566057be/lib/arch/avx2
// SSSE3 version based on https://github.com/aklomp/base64/tree/e516d769a2a432c08404f1981e73b431566057be/lib/arch/ssse3

Plan is to base the Arm64 version on the Neon version from the same repo:
https://github.com/aklomp/base64/blob/e516d769a2a432c08404f1981e73b431566057be/lib/arch/neon64/dec_loop.c

Due to this being based on table looks instead of shuffles, there is likely to be no sharing between the X86 and Arm64 version.

kunalspathak · 2022-05-23T19:05:12Z

there is likely to be no sharing between the X86 and Arm64 version.

And that should be fine.

a74nh · 2022-05-24T17:57:35Z

In order to reuse the aklomp algorithm, there are still missing API functions:

TableLookup - vqtbl4q_u8 - TBL
TableLookupExtension - vqtbx4q_u8 - TBX
- Implementation proposed in API Proposal : Arm TableVectorLookup and TableVectorExtension intrinsics #1277
Store 3 Vector128 - vst3q_u8 - ST3
Load 4 Vector128 - vld4q_u8 - LD4

These all require sequential registers - which AIUI is quite a big task to implement.

Can we still do the routine without them?

The tbl4q instructions can be broken down to 4 tbx instructions + some extra indexing.
the LD4 and ST3 could be replaced by multiple LD and ST instructions. But would then require manual interleaving. I think that can be done using zip/unzp.
Or, it's possible may be able to remove the interleaving requirement from the algorithm.
These changes would make the function non-optimal, but it should still be faster than the sequential version.

kunalspathak · 2022-05-24T18:09:47Z

@a74nh - Just FYI, we have hardware intrinsics APIs exposed under System.Runtime.Intrinsics.Arm.AdvSimd namespace, in case we can use some of them to optimize and until we add the missing APIs.

https://source.dot.net/#System.Private.CoreLib/AdvSimd.PlatformNotSupported.cs,f54d7472d19d4e2b

EgorBo · 2022-05-24T18:16:01Z

TableLookup - vqtbl4q_u8 - TBL
TableLookupExtension - vqtbx4q_u8 - TBX

can the byte-versions be used instead here? AdvSimd.VectorTableLookup and AdvSimd.Arm64.VectorTableLookup

I see the x64 code uses byte indices

a74nh · 2022-05-25T07:56:47Z

can the byte-versions be used instead here? AdvSimd.VectorTableLookup and AdvSimd.Arm64.VectorTableLookup

I see the x64 code uses byte indices

Those are the single register version - which we can use but we'll need to use 4 of them. So, doable but not optimal.

a74nh · 2022-06-07T12:39:37Z

An implementation of DecodeFromUtf8 without using LD4/ST3/TBX4 is here: #70336

Currently getting 3x slowdown using it. If can't make it faster, then probably best waiting for LD4/ST3/TBX4

kunalspathak · 2022-06-20T18:23:03Z

Closed by #70654

BruceForstall added arch-arm64 area-System.Buffers labels Apr 16, 2020

BruceForstall added this to the 5.0 milestone Apr 16, 2020

BruceForstall added this to To do general in Hardware Intrinsics via automation Apr 16, 2020

Dotnet-GitSync-Bot added the untriaged New issue has not been triaged by the area owner label Apr 16, 2020

BruceForstall moved this from To do general to To do arm64 in Hardware Intrinsics Apr 16, 2020

BruceForstall mentioned this issue Apr 16, 2020

Optimize library code using arm64 intrinsics #33308

Closed

BruceForstall removed the untriaged New issue has not been triaged by the area owner label Jun 3, 2020

jeffhandley modified the milestones: 5.0.0, 6.0.0 Aug 24, 2020

jeffhandley assigned jeffhandley, pgovind and tannergooding Jan 13, 2021

jeffhandley added the Priority:3 Work that is nice to have label Jan 14, 2021

tannergooding added the Cost:M Work that requires one engineer up to 2 weeks label Jan 15, 2021

jeffhandley modified the milestones: 6.0.0, Future Apr 14, 2021

jeffhandley mentioned this issue Feb 1, 2022

System.Buffers work planned for .NET 7 #64598

Closed

7 tasks

jeffhandley changed the title ~~Optimize System.Buffers using arm64 intrinsics~~ Optimize System.Buffers for arm64 using cross-platform intrinsics Feb 23, 2022

kunalspathak unassigned jeffhandley Mar 28, 2022

kunalspathak mentioned this issue Mar 28, 2022

Improving ARM64 Performance in .NET 7.0 #64820

Closed

32 tasks

EgorBo mentioned this issue Mar 30, 2022

x64 vs ARM64 Microbenchmarks Performance Study Report #67339

Open

19 tasks

kunalspathak closed this as completed Jun 20, 2022

Hardware Intrinsics automation moved this from To do arm64 to Done Jun 20, 2022

ghost locked as resolved and limited conversation to collaborators Jul 20, 2022

Optimize System.Buffers for arm64 using cross-platform intrinsics #35033

Optimize System.Buffers for arm64 using cross-platform intrinsics #35033

Comments

BruceForstall commented Apr 16, 2020 • edited by kunalspathak Loading

ghost commented Apr 16, 2020

john-h-k commented Apr 23, 2020

john-h-k commented Apr 24, 2020 • edited Loading

john-h-k commented Apr 24, 2020

tannergooding commented Apr 24, 2020

BruceForstall commented Jun 3, 2020

tannergooding commented Jun 3, 2020

BruceForstall commented Jun 11, 2020

tannergooding commented Jun 11, 2020

john-h-k commented Jun 11, 2020

BruceForstall commented Aug 3, 2020

jeffhandley commented Aug 3, 2020

jeffhandley commented Aug 24, 2020

kunalspathak commented Mar 28, 2022 • edited Loading

gfoidl commented Mar 28, 2022

tannergooding commented Mar 28, 2022

kunalspathak commented Mar 28, 2022

a74nh commented Mar 29, 2022

EgorBo commented Mar 30, 2022

gfoidl commented Mar 30, 2022

kunalspathak commented May 20, 2022

tannergooding commented May 20, 2022

a74nh commented May 20, 2022

a74nh commented May 20, 2022

tannergooding commented May 20, 2022

a74nh commented May 20, 2022

a74nh commented May 23, 2022

a74nh commented May 23, 2022

kunalspathak commented May 23, 2022

a74nh commented May 24, 2022 • edited Loading

kunalspathak commented May 24, 2022

EgorBo commented May 24, 2022

a74nh commented May 25, 2022

a74nh commented Jun 7, 2022

kunalspathak commented Jun 20, 2022

BruceForstall commented Apr 16, 2020 •

edited by kunalspathak

Loading

john-h-k commented Apr 24, 2020 •

edited

Loading

kunalspathak commented Mar 28, 2022 •

edited

Loading

a74nh commented May 24, 2022 •

edited

Loading