Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize System.Buffers for arm64 using cross-platform intrinsics #35033

Closed
1 of 2 tasks
Tracked by #64598 ...
BruceForstall opened this issue Apr 16, 2020 · 35 comments
Closed
1 of 2 tasks
Tracked by #64598 ...

Optimize System.Buffers for arm64 using cross-platform intrinsics #35033

BruceForstall opened this issue Apr 16, 2020 · 35 comments
Labels
arch-arm64 area-System.Buffers Cost:M Work that requires one engineer up to 2 weeks Priority:2 Work that is important, but not critical for the release
Milestone

Comments

@BruceForstall
Copy link
Member

BruceForstall commented Apr 16, 2020

This item tracks the conversion of the System.Buffers class to use arm64 intrinsics.

  • System.Buffers.Text.Base64.DecodeFromUtf8()
  • System.Buffers.Text.Base64.EncodeToUtf8()

Update for .NET 7
Now that we have cross-platform intrinsics APIs (#49397), these optimizations should be completed using those APIs instead of adding ARM64-specific intrinsics code paths. The effort could optionally include first measuring performance of these methods with the ARM64-specific intrinsics in place, and then measuring the performance of these methods with the cross-platform intrinsics.

Example use of the new cross-platform intrinsics: #63722

Related: #33308

@BruceForstall BruceForstall added this to the 5.0 milestone Apr 16, 2020
@BruceForstall BruceForstall added this to To do general in Hardware Intrinsics via automation Apr 16, 2020
@ghost
Copy link

ghost commented Apr 16, 2020

Tagging subscribers to this area: @tannergooding
Notify danmosemsft if you want to be subscribed.

@Dotnet-GitSync-Bot Dotnet-GitSync-Bot added the untriaged New issue has not been triaged by the area owner label Apr 16, 2020
@BruceForstall BruceForstall moved this from To do general to To do arm64 in Hardware Intrinsics Apr 16, 2020
@john-h-k
Copy link
Contributor

I can give this a shot

@john-h-k
Copy link
Contributor

john-h-k commented Apr 24, 2020

These ARM64 intrinsics are missing to make the code "worthwhile":

Tracked by #32512: Done!

UnsignedSaturatingSubtract - vqsubq_u8 - UQSUB

Tracked by #33398:

UnsignedShiftRight - vshrq_n_u8 - USHR
UnsignedShiftLeft - vshlq_n_u8 - USHL

Tracked by #1277: Done!

TableLookup - vqtbl4q_u8 - TBL
TableLookupExtension - vqtbx4q_u8 - TBX

These helper methods could be accelerated to help it and are tracked in #33496:

Vector128.Create(byte)

These intrinsics could be introduced to maximise perf:

Load 3 Vector128<T> - vld3q_u8 - LD3
Store 3 Vector128<T> - vst3q_u8 - ST3
Load 4 Vector128<T> - vld4q_u8 - LD4
Store 4 Vector128<T> - vst4q_u8 - ST4

@john-h-k
Copy link
Contributor

john-h-k@adeea36

Draft commit, most that can be done until necessary intrinsics are added

@tannergooding
Copy link
Member

Thanks @john-h-k, as per our conversation on Discord, I'll ping you once the necessary intrinsics are added so you can make further progress.

@BruceForstall
Copy link
Member Author

@tannergooding @john-h-k The shift intrinsics went in with #36830. There have been many improvements to System.Runtime.Intrinsics optimization in #33496. Is this work now unblocked?

@BruceForstall BruceForstall removed the untriaged New issue has not been triaged by the area owner label Jun 3, 2020
@tannergooding
Copy link
Member

Yes, this should be unblocked now.

@BruceForstall
Copy link
Member Author

@john-h-k Are you planning to try and finish this for .NET 5, now that it's unblocked?

@tannergooding
Copy link
Member

If not, we'll have 3-4 people free to help tackle the remaining issues from #33308 on Monday.

CC. @jeffhandley

@john-h-k
Copy link
Contributor

Yes, I should be. I just need to benchmark the 2 possible impls I've got locally, either on my old Pi2b or I can get a friend to do it on newer device

@BruceForstall
Copy link
Member Author

@tannergooding @jeffhandley What's the plan for this for .NET 5?

@jeffhandley
Copy link
Member

@tannergooding is planning to carry this one over the finish line ahead of RC1.

@jeffhandley jeffhandley modified the milestones: 5.0.0, 6.0.0 Aug 24, 2020
@jeffhandley
Copy link
Member

We could not get this completed in time for 5.0.0; moved to 6.0.0.

@jeffhandley jeffhandley added the Priority:3 Work that is nice to have label Jan 14, 2021
@tannergooding tannergooding added the Cost:M Work that requires one engineer up to 2 weeks label Jan 15, 2021
@jeffhandley jeffhandley modified the milestones: 6.0.0, Future Apr 14, 2021
@jeffhandley jeffhandley changed the title Optimize System.Buffers using arm64 intrinsics Optimize System.Buffers for arm64 using cross-platform intrinsics Feb 23, 2022
@kunalspathak
Copy link
Member

kunalspathak commented Mar 28, 2022

@a74nh - I am assigning this to you. Feel free to reassign to someone who will work on this.

Edit: For some reason, I am not able to assign this.

@gfoidl
Copy link
Member

gfoidl commented Mar 28, 2022

Base64 relies heavily on shuffling, so shuffle intrinsics are a prerequisite for this issue.
Is shuffling a xplat-intrinsics now? At least on .NET 7 Preview 2 I can't find it.

@tannergooding
Copy link
Member

Is shuffling a xplat-intrinsics now? At least on .NET 7 Preview 2 I can't find it.

I'm actively working on the support locally. Hoping to get it done soon but needing to balance that work against the Generic Math work.

@kunalspathak
Copy link
Member

I'm actively working on the support locally.

#63331

@a74nh
Copy link
Contributor

a74nh commented Mar 29, 2022

@a74nh - I am assigning this to you. Feel free to reassign to someone who will work on this.

Edit: For some reason, I am not able to assign this.

I can't assign this either. It should go to @SwapnilGaikwad instead of me.

@EgorBo
Copy link
Member

EgorBo commented Mar 30, 2022

Base64 relies heavily on shuffling, so shuffle intrinsics are a prerequisite for this issue. Is shuffling a xplat-intrinsics now? At least on .NET 7 Preview 2 I can't find it.

in #67192 for hex encoding I introduced a temp helper till #63331 is landed 🙂

@gfoidl
Copy link
Member

gfoidl commented Mar 30, 2022

IMO it's better to wait for #63331
Otherwise the code has to be touched again to revert the helper-change.

I'm also going to update https://github.com/gfoidl/Base64, and there by a scheduled CI-trigger fuzz-tests are run, so any fault should get caught early enough for .NET 7 RTM.

@kunalspathak
Copy link
Member

Looks like Shuffle has landed in #68559 so this can get started? Note, as called out in #67339 - lot of Base64Encode benchmarks like System.Buffers.Text.Tests.Base64Tests.Base64Encode(NumberOfBytes: 1000) are up to 16 times slower. So we should fix this sooner to make sure it lands in .NET 7.

@a74nh , @SwapnilGaikwad

@tannergooding
Copy link
Member

Noting that only the Shuffle(x, mask) API has landed and it still has a few quirks, but should be generally usable especially with constant inputs.

The Shuffle(x, y, mask) API is going to need #68874 (which needs a couple bits of PR feedback handled and a final review).

@a74nh
Copy link
Contributor

a74nh commented May 20, 2022

We are looking at this....

With a recent HEAD, took the ss2 C# code and (mostly) updated it to use Vector128.
On Arm64, most things generated into neon code ... but for Vector128.Shuffle() it generated to a function call:

Annotated IR dump of part of the function:

store str
IN0024: 0000B0                    str     q0, [fp,#144]	// [V17 loc10]

Load str
IN0025: 0000B4                    ldr     q0, [fp,#144]	// [V17 loc10]

Vector128.ShiftRightLogical(str.AsInt32(), 4).AsSByte(),
IN0026: 0000B8                    ushr    v0.4s, v0.4s, #4

Load mask2F
IN0027: 0000BC                    ldr     q1, [fp,#224]	// [V11 loc4]

Vector128<sbyte> hiNibbles = Vector128.BitwiseAnd
IN0028: 0000C0                    and     v0.16b, v0.16b, v1.16b

Store back to hiNibbles
IN0029: 0000C4                    str     q0, [fp,#128]	// [V18 loc11]

Load str (again!)
IN002a: 0000C8                    ldr     q0, [fp,#144]	// [V17 loc10]

Load mask2F (again!)
IN002b: 0000CC                    ldr     q1, [fp,#224]	// [V11 loc4]

Vector128<sbyte> loNibbles = Vector128.BitwiseAnd(str, mask2F);
IN002c: 0000D0                    and     v0.16b, v0.16b, v1.16b

Store loNibbles
IN002d: 0000D4                    str     q0, [fp,#112]	// [V19 loc12]

Load lutHi
IN002e: 0000D8                    ldr     q0, [fp,#288]	// [V07 loc0]

Load hiNibbles
IN002f: 0000DC                    ldr     q1, [fp,#128]	// [V18 loc11]

Function call to shuffle
IN0030: 0000E0                    movz    x0, #0xb7c8
IN0031: 0000E4                    movk    x0, #0x7e74 LSL #16
IN0032: 0000E8                    movk    x0, #0xffff LSL #32
IN0033: 0000EC                    ldr     x0, [x0]
IN0034: 0000F0                    blr     x0
IN0035: 0000F4                    str     q0, [fp,#96]	// [V20 loc13]

Load lutLo
IN0036: 0000F8                    ldr     q0, [fp,#272]	// [V08 loc1]

Load loNibbles
IN0037: 0000FC                    ldr     q1, [fp,#112]	// [V19 loc12]

Function call to shuffle
IN0038: 000100                    movz    x0, #0xb7c8
IN0039: 000104                    movk    x0, #0x7e74 LSL #16
IN003a: 000108                    movk    x0, #0xffff LSL #32
IN003b: 00010C                    ldr     x0, [x0]
IN003c: 000110                    blr     x0
IN003d: 000114                    str     q0, [fp,#80]	// [V21 loc14]

@a74nh
Copy link
Contributor

a74nh commented May 20, 2022

The above dump was using
public static Vector128 Shuffle(Vector128 vector, Vector128 indices)

@tannergooding
Copy link
Member

On Arm64, most things generated into neon code ... but for Vector128.Shuffle() it generated to a function call

Can you share the managed code that was being used here, preferably including the Shuffle call and how its arguments are being passed in?

@a74nh
Copy link
Contributor

a74nh commented May 20, 2022

This code was made by taking Ssse3Decode(), replacing Sse2 with Vector128, then fixing up the function calls. Quite possible there's an obvious mistake on my end.

        // The JIT won't hoist these "constants", so help it
        Vector128<sbyte> lutHi = Vector128.Create(
            0x10, 0x10, 0x01, 0x02,
            0x04, 0x08, 0x04, 0x08,
            0x10, 0x10, 0x10, 0x10,
            0x10, 0x10, 0x10, 0x10);

        Vector128<sbyte> lutLo = Vector128.Create(
            0x15, 0x11, 0x11, 0x11,
            0x11, 0x11, 0x11, 0x11,
            0x11, 0x11, 0x13, 0x1A,
            0x1B, 0x1B, 0x1B, 0x1A);

        Vector128<sbyte> lutShift = Vector128.Create(
            0, 16, 19, 4,
            -65, -65, -71, -71,
            0, 0, 0, 0,
            0, 0, 0, 0);

        Vector128<sbyte> packBytesMask = Vector128.Create(
            2, 1, 0, 6,
            5, 4, 10, 9,
            8, 14, 13, 12,
            -1, -1, -1, -1);

        Vector128<sbyte> mask2F = Vector128.Create((sbyte)'/');
        Vector128<sbyte> mergeConstant0 = Vector128.Create(0x01400140).AsSByte();
        Vector128<short> mergeConstant1 = Vector128.Create(0x00011000).AsInt16();
        Vector128<sbyte> zero = Vector128<sbyte>.Zero;

        byte* src =  srcBytes;
        byte* dest = destBytes;



            Vector128<sbyte> str = Vector128.LoadUnsafe(ref *src).AsSByte();

            // lookup
            Vector128<sbyte> hiNibbles = Vector128.BitwiseAnd(Vector128.ShiftRightLogical(str.AsInt32(), 4).AsSByte(), mask2F);
            Vector128<sbyte> loNibbles = Vector128.BitwiseAnd(str, mask2F);
            Vector128<sbyte> hi = Vector128.Shuffle(lutHi, hiNibbles);
            Vector128<sbyte> lo = Vector128.Shuffle(lutLo, loNibbles);

@a74nh
Copy link
Contributor

a74nh commented May 23, 2022

Although - Lack of an Arm64 version of shuffle isn't surprising given there is not obvious match to the instruction

@a74nh
Copy link
Contributor

a74nh commented May 23, 2022

From the top of Base64Decoder.cs:

// AVX2 version based on https://github.com/aklomp/base64/tree/e516d769a2a432c08404f1981e73b431566057be/lib/arch/avx2
// SSSE3 version based on https://github.com/aklomp/base64/tree/e516d769a2a432c08404f1981e73b431566057be/lib/arch/ssse3

Plan is to base the Arm64 version on the Neon version from the same repo:
https://github.com/aklomp/base64/blob/e516d769a2a432c08404f1981e73b431566057be/lib/arch/neon64/dec_loop.c

Due to this being based on table looks instead of shuffles, there is likely to be no sharing between the X86 and Arm64 version.

@kunalspathak
Copy link
Member

there is likely to be no sharing between the X86 and Arm64 version.

And that should be fine.

@a74nh
Copy link
Contributor

a74nh commented May 24, 2022

In order to reuse the aklomp algorithm, there are still missing API functions:

These all require sequential registers - which AIUI is quite a big task to implement.

Can we still do the routine without them?

  • The tbl4q instructions can be broken down to 4 tbx instructions + some extra indexing.
  • the LD4 and ST3 could be replaced by multiple LD and ST instructions. But would then require manual interleaving. I think that can be done using zip/unzp.
  • Or, it's possible may be able to remove the interleaving requirement from the algorithm.
  • These changes would make the function non-optimal, but it should still be faster than the sequential version.

@kunalspathak
Copy link
Member

@a74nh - Just FYI, we have hardware intrinsics APIs exposed under System.Runtime.Intrinsics.Arm.AdvSimd namespace, in case we can use some of them to optimize and until we add the missing APIs.

https://source.dot.net/#System.Private.CoreLib/AdvSimd.PlatformNotSupported.cs,f54d7472d19d4e2b

@EgorBo
Copy link
Member

EgorBo commented May 24, 2022

TableLookup - vqtbl4q_u8 - TBL
TableLookupExtension - vqtbx4q_u8 - TBX

can the byte-versions be used instead here? AdvSimd.VectorTableLookup and AdvSimd.Arm64.VectorTableLookup

I see the x64 code uses byte indices

@a74nh
Copy link
Contributor

a74nh commented May 25, 2022

can the byte-versions be used instead here? AdvSimd.VectorTableLookup and AdvSimd.Arm64.VectorTableLookup

I see the x64 code uses byte indices

Those are the single register version - which we can use but we'll need to use 4 of them. So, doable but not optimal.

@a74nh
Copy link
Contributor

a74nh commented Jun 7, 2022

An implementation of DecodeFromUtf8 without using LD4/ST3/TBX4 is here: #70336

Currently getting 3x slowdown using it. If can't make it faster, then probably best waiting for LD4/ST3/TBX4

@kunalspathak
Copy link
Member

Closed by #70654

Hardware Intrinsics automation moved this from To do arm64 to Done Jun 20, 2022
@ghost ghost locked as resolved and limited conversation to collaborators Jul 20, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
arch-arm64 area-System.Buffers Cost:M Work that requires one engineer up to 2 weeks Priority:2 Work that is important, but not critical for the release
Projects
Development

No branches or pull requests

10 participants