Use multi-reg load/store for DecodeFromUTF8 #100589

SwapnilGaikwad · 2024-04-03T14:15:37Z

It implements the decode from UTF8 algorithm listed here.

SwapnilGaikwad · 2024-04-03T15:39:53Z

On an N1 machine, the patch improves performance:

| Method                          | Toolchain                                                                  | MemoryRandomization | NumberOfBytes | Mean     | Error    | StdDev   | Median   | Min      | Max      | Ratio | MannWhitney(2%) |
|-------------------------------- |--------------------------------------------------------------------------- |-------------------- |-------------- |---------:|---------:|---------:|---------:|---------:|---------:|------:|---------------- |
| Base64Decode                    | /base/artifacts/tests/coreclr/linux.arm64.Release/Tests/Core_Root/corerun  | Default             | 1000          | 357.7 ns |  0.18 ns |  0.16 ns | 357.7 ns | 357.3 ns | 357.9 ns |  1.00 | Base            |
| Base64Decode                    | /patch/artifacts/tests/coreclr/linux.arm64.Release/Tests/Core_Root/corerun | Default             | 1000          | 283.7 ns |  0.03 ns |  0.03 ns | 283.7 ns | 283.7 ns | 283.8 ns |  0.79 | Faster          |
|                                 |                                                                            |                     |               |          |          |          |          |          |          |       |                 |
| Base64DecodeDestinationTooSmall | /base/artifacts/tests/coreclr/linux.arm64.Release/Tests/Core_Root/corerun  | True                | 1000          | 369.0 ns | 10.47 ns | 12.06 ns | 368.5 ns | 356.9 ns | 383.1 ns |  1.00 | Base            |
| Base64DecodeDestinationTooSmall | /patch/artifacts/tests/coreclr/linux.arm64.Release/Tests/Core_Root/corerun | True                | 1000          | 286.4 ns |  3.69 ns |  3.46 ns | 283.9 ns | 283.8 ns | 291.5 ns |  0.78 | Faster          |

SwapnilGaikwad · 2024-04-05T16:21:29Z

@kunalspathak @a74nh

kunalspathak · 2024-04-05T17:35:54Z

@dotnet/jit-contrib @tannergooding

a74nh · 2024-04-08T08:47:55Z

src/libraries/System.Private.CoreLib/src/System/Buffers/Text/Base64Decoder.cs

+
+                // Invalid values are set to 255 during above look-ups using 'decLutTwo' and 'decLutTwo'.
+                // Thus the intermediate results 'decOne' and 'decTwo' could be OR-ed to get final values.
+                str1 = AdvSimd.Or(decOne1, decTwo1);


Question for others: When writing an routine that is purely AdvSimd, for basic functions such as Or should we be using AdvSimd throughout or mixing APIs and using Vector128 ?

I agree. We should just use decOne1 | decTwo1

In general we want to avoid writing routines that are purely for a given architecture or ISA where-ever possible.

It is simply not maintainable to write and support implementations for all of:

Arm64

AdvSimd

Sve/Sve2

x86/x64

Sse2/Sse4.1

Avx/Avx2

Avx512/Avx10

WASM

RISC-V

LoongArch64

etc

So, in the ideal world we are using xplat code and sharing across all target platforms for the vast majority of code we write. The only duplication we should have in the common case is for different vector sizes. That is, currently duplicating code across V128/V256/V512 and Vector, based on the sizes we want to support -- Even this is something we're looking at allowing more sharing of in the future via ISimdVector<TSelf, T>, so it is truly write once as much as possible.

We then opportunistically light up usage of platform specific behavior only if the additional complexity is justified by the perf gains and ideally only where absolutely required, still using shared xplat code for most of the algorithm.

The xplat APIs should cover most core functionality and the JIT should recognize common patterns and optimize to better platform specific codegen where possible. Where there is some functionality that might be missing, we can propose and expose new APIs that work better across the range of platforms we need to support.

src/libraries/System.Private.CoreLib/src/System/Buffers/Text/Base64Decoder.cs

a74nh

Otherwise LGTM

src/libraries/System.Private.CoreLib/src/System/Buffers/Text/Base64Decoder.cs

kunalspathak

add few comments and could you please share the code generated.

SwapnilGaikwad · 2024-04-09T16:51:59Z

add few comments and could you please share the code generated.

Assembly for the 'AdvSimdDecode'. It's extracted from the 'System.Private.CoreLib.dll'.

b.cc    0xffffa8327378  // b.lo, b.ul, b.last
mov     x6, x27
str     x28, [x29, #272]
movi    v8.16b, #0x3f           // <==== Line#894: DuplicateToVector128()
str     q8, [x29, #256]
ldr     q9, 0xffffa83278b0
str     q9, [x29, #96]
ldr     q10, 0xffffa83278c0
str     q10, [x29, #80]
ldr     q11, 0xffffa83278d0
str     q11, [x29, #64]
ldr     q12, 0xffffa83278e0
str     q12, [x29, #48]
ldr     q13, 0xffffa83278f0
str     q13, [x29, #32]
ldr     q14, 0xffffa8327900
str     q14, [x29, #16]
str     w4, [x29, #344]
mov     w2, w4
str     x6, [x29, #280]
mov     x0, x6
mov     x1, x27
adrp    x11, 0xffffa8e1c000
add     x11, x11, #0xef8
mov     v8.d[0], v9.d[1]
mov     v11.d[0], v10.d[1]
ldr     x13, [x11]
blr     x13
ldr     x6, [x29, #280]
ld4     {v16.16b-v19.16b}, [x6]  // <==== Line#902: LoadVector128x4AndUnzip()/approx. loop start
stp     q16, q17, [x29, #192]
stp     q18, q19, [x29, #224]
ldp     q16, q17, [x29, #192]
ldp     q18, q19, [x29, #224]
ldr     q20, [x29, #256]
uqsub   v21.16b, v16.16b, v20.16b
uqsub   v22.16b, v17.16b, v20.16b
uqsub   v23.16b, v18.16b, v20.16b
uqsub   v24.16b, v19.16b, v20.16b
mvni    v25.4s, #0x0
mvni    v26.4s, #0x0
mov     v9.d[1], v8.d[0]
mov     v10.d[1], v11.d[0]
mov     v27.16b, v9.16b
mov     v28.16b, v10.16b
tbl     v16.16b, {v25.16b-v28.16b}, v16.16b
mvni    v25.4s, #0x0
mvni    v26.4s, #0x0
mov     v27.16b, v25.16b
mov     v28.16b, v26.16b
mov     v29.16b, v9.16b
mov     v30.16b, v10.16b
tbl     v17.16b, {v27.16b-v30.16b}, v17.16b
mvni    v25.4s, #0x0
mvni    v26.4s, #0x0
mov     v27.16b, v9.16b
mov     v28.16b, v10.16b
tbl     v18.16b, {v25.16b-v28.16b}, v18.16b
mvni    v25.4s, #0x0
mvni    v26.4s, #0x0
mov     v27.16b, v25.16b
mov     v28.16b, v26.16b
mov     v29.16b, v9.16b
mov     v30.16b, v10.16b
tbl     v19.16b, {v27.16b-v30.16b}, v19.16b
ldp     q26, q25, [x29, #48]
ldp     q28, q27, [x29, #16]
tbx     v21.16b, {v25.16b-v28.16b}, v21.16b
tbx     v22.16b, {v25.16b-v28.16b}, v22.16b
tbx     v23.16b, {v25.16b-v28.16b}, v23.16b
tbx     v24.16b, {v25.16b-v28.16b}, v24.16b
orr     v16.16b, v16.16b, v21.16b
orr     v17.16b, v17.16b, v22.16b
orr     v18.16b, v18.16b, v23.16b
orr     v19.16b, v19.16b, v24.16b
cmhi    v21.16b, v16.16b, v20.16b
cmhi    v22.16b, v17.16b, v20.16b
orr     v21.16b, v21.16b, v22.16b
cmhi    v22.16b, v18.16b, v20.16b
orr     v21.16b, v21.16b, v22.16b
cmhi    v22.16b, v19.16b, v20.16b
orr     v21.16b, v21.16b, v22.16b
umaxp   v21.4s, v21.4s, v21.4s
mov     x2, v21.d[0]
cmp     x2, #0x0
b.ne    0xffffa8327350  // b.any
shl     v16.16b, v16.16b, #2
ushr    v21.16b, v17.16b, #4
orr     v8.16b, v16.16b, v21.16b
shl     v16.16b, v17.16b, #4
ushr    v17.16b, v18.16b, #2
orr     v11.16b, v16.16b, v17.16b
shl     v16.16b, v18.16b, #6
orr     v12.16b, v16.16b, v19.16b
mov     w2, w19
ldr     x0, [x29, #272]
mov     x1, x28
adrp    x11, 0xffffa8e1c000
add     x11, x11, #0xf00
mov     v14.d[0], v8.d[1]
mov     v9.d[0], v11.d[1]
mov     v10.d[0], v12.d[1]
ldr     x3, [x11]
blr     x3
mov     v8.d[1], v14.d[0]
mov     v11.d[1], v9.d[0]
mov     v12.d[1], v10.d[0]
ldr     x7, [x29, #272]
mov     v16.16b, v8.16b
mov     v17.16b, v11.16b
mov     v18.16b, v12.16b
st3     {v16.16b-v18.16b}, [x7]
ldr     x6, [x29, #280]
add     x6, x6, #0x40
add     x7, x7, #0x30
ldr     x3, [x29, #288]
cmp     x6, x3
str     x7, [x29, #272]
str     x3, [x29, #288]
ldp     q10, q9, [x29, #80]
b.ls    0xffffa832754c  // b.plast
str     x6, [x29, #280]         //  <==== End of loop
ldr     x6, [x29, #280]
mov     x4, x6
ldr     x7, [x29, #272]
mov     x5, x7
ldr     x6, [x29, #312]
cmp     x4, x6
b.eq    0xffffa8327740  // b.none

SwapnilGaikwad · 2024-04-13T16:22:22Z

Hi @kunalspathak, there is no change in assembly after the refactoring.

New assembly sequence

b.cc	0xffffa8337378  // b.lo, b.ul, b.last
mov	x6, x27
str	x28, [x29, #272]
movi	v8.16b, #0x3f
str	q8, [x29, #256]
ldr	q9, 0xffffa83378b0
str	q9, [x29, #96]
ldr	q10, 0xffffa83378c0
str	q10, [x29, #80]
ldr	q11, 0xffffa83378d0
str	q11, [x29, #64]
ldr	q12, 0xffffa83378e0
str	q12, [x29, #48]
ldr	q13, 0xffffa83378f0
str	q13, [x29, #32]
ldr	q14, 0xffffa8337900
str	q14, [x29, #16]
str	w4, [x29, #344]
mov	w2, w4
str	x6, [x29, #280]
mov	x0, x6
mov	x1, x27
adrp	x11, 0xffffa8e2c000
add	x11, x11, #0xef8
mov	v8.d[0], v9.d[1]
mov	v11.d[0], v10.d[1]
ldr	x13, [x11]
blr	x13
ldr	x6, [x29, #280]
ld4	{v16.16b-v19.16b}, [x6]
stp	q16, q17, [x29, #192]
stp	q18, q19, [x29, #224]
ldp	q16, q17, [x29, #192]
ldp	q18, q19, [x29, #224]
ldr	q20, [x29, #256]
uqsub	v21.16b, v16.16b, v20.16b
uqsub	v22.16b, v17.16b, v20.16b
uqsub	v23.16b, v18.16b, v20.16b
uqsub	v24.16b, v19.16b, v20.16b
mvni	v25.4s, #0x0
mvni	v26.4s, #0x0
mov	v9.d[1], v8.d[0]
mov	v10.d[1], v11.d[0]
mov	v27.16b, v9.16b
mov	v28.16b, v10.16b
tbl	v16.16b, {v25.16b-v28.16b}, v16.16b
mvni	v25.4s, #0x0
mvni	v26.4s, #0x0
mov	v27.16b, v25.16b
mov	v28.16b, v26.16b
mov	v29.16b, v9.16b
mov	v30.16b, v10.16b
tbl	v17.16b, {v27.16b-v30.16b}, v17.16b
mvni	v25.4s, #0x0
mvni	v26.4s, #0x0
mov	v27.16b, v9.16b
mov	v28.16b, v10.16b
tbl	v18.16b, {v25.16b-v28.16b}, v18.16b
mvni	v25.4s, #0x0
mvni	v26.4s, #0x0
mov	v27.16b, v25.16b
mov	v28.16b, v26.16b
mov	v29.16b, v9.16b
mov	v30.16b, v10.16b
tbl	v19.16b, {v27.16b-v30.16b}, v19.16b
ldp	q26, q25, [x29, #48]
ldp	q28, q27, [x29, #16]
tbx	v21.16b, {v25.16b-v28.16b}, v21.16b
tbx	v22.16b, {v25.16b-v28.16b}, v22.16b
tbx	v23.16b, {v25.16b-v28.16b}, v23.16b
tbx	v24.16b, {v25.16b-v28.16b}, v24.16b
orr	v16.16b, v16.16b, v21.16b
orr	v17.16b, v17.16b, v22.16b
orr	v18.16b, v18.16b, v23.16b
orr	v19.16b, v19.16b, v24.16b
cmhi	v21.16b, v16.16b, v20.16b
cmhi	v22.16b, v17.16b, v20.16b
orr	v21.16b, v21.16b, v22.16b
cmhi	v22.16b, v18.16b, v20.16b
orr	v21.16b, v21.16b, v22.16b
cmhi	v22.16b, v19.16b, v20.16b
orr	v21.16b, v21.16b, v22.16b
umaxp	v21.4s, v21.4s, v21.4s
mov	x2, v21.d[0]
cmp	x2, #0x0
b.ne	0xffffa8337350  // b.any
shl	v16.16b, v16.16b, #2
ushr	v21.16b, v17.16b, #4
orr	v8.16b, v16.16b, v21.16b
shl	v16.16b, v17.16b, #4
ushr	v17.16b, v18.16b, #2
orr	v11.16b, v16.16b, v17.16b
shl	v16.16b, v18.16b, #6
orr	v12.16b, v16.16b, v19.16b
mov	w2, w19
ldr	x0, [x29, #272]
mov	x1, x28
adrp	x11, 0xffffa8e2c000
add	x11, x11, #0xf00
mov	v14.d[0], v8.d[1]
mov	v9.d[0], v11.d[1]
mov	v10.d[0], v12.d[1]
ldr	x3, [x11]
blr	x3
mov	v8.d[1], v14.d[0]
mov	v11.d[1], v9.d[0]
mov	v12.d[1], v10.d[0]
ldr	x7, [x29, #272]
mov	v16.16b, v8.16b
mov	v17.16b, v11.16b
mov	v18.16b, v12.16b
st3	{v16.16b-v18.16b}, [x7]
ldr	x6, [x29, #280]
add	x6, x6, #0x40
add	x7, x7, #0x30
ldr	x3, [x29, #288]
cmp	x6, x3
str	x7, [x29, #272]
str	x3, [x29, #288]
ldp	q10, q9, [x29, #80]
b.ls	0xffffa833754c  // b.plast
str	x6, [x29, #280]
ldr	x6, [x29, #280]
mov	x4, x6
ldr	x7, [x29, #272]
mov	x5, x7
ldr	x6, [x29, #312]
cmp	x4, x6
b.eq	0xffffa8337740

kunalspathak

LGTM

tannergooding · 2024-04-14T14:29:28Z

src/libraries/System.Private.CoreLib/src/System/Buffers/Text/Base64Decoder.cs

+            Vector128<byte> decLutOne1 = Vector128.Create(0xFFFFFFFF, 0xFFFFFFFF, 0xFFFFFFFF, 0xFFFFFFFF).AsByte();
+            Vector128<byte> decLutOne2 = Vector128.Create(0xFFFFFFFF, 0xFFFFFFFF, 0xFFFFFFFF, 0xFFFFFFFF).AsByte();


These could have been Vector128<byte>.AllBitsSet

tannergooding · 2024-04-14T14:31:00Z

src/libraries/System.Private.CoreLib/src/System/Buffers/Text/Base64Decoder.cs

+            Vector128<byte> decLutOne1 = Vector128.Create(0xFFFFFFFF, 0xFFFFFFFF, 0xFFFFFFFF, 0xFFFFFFFF).AsByte();
+            Vector128<byte> decLutOne2 = Vector128.Create(0xFFFFFFFF, 0xFFFFFFFF, 0xFFFFFFFF, 0xFFFFFFFF).AsByte();
+            Vector128<byte> decLutOne3 = Vector128.Create(0xFFFFFFFF, 0xFFFFFFFF, 0x3EFFFFFF, 0x3FFFFFFF).AsByte();
+            Vector128<byte> decLutOne4 = Vector128.Create(0x37363534, 0x3B3A3938, 0xFFFF3D3C, 0xFFFFFFFF).AsByte();
+            Vector128<byte> decLutTwo1 = Vector128.Create(0x0100FF00, 0x05040302, 0x09080706, 0x0D0C0B0A).AsByte();
+            Vector128<byte> decLutTwo2 = Vector128.Create(0x11100F0E, 0x15141312, 0x19181716, 0xFFFFFFFF).AsByte();
+            Vector128<byte> decLutTwo3 = Vector128.Create(0x1B1AFFFF, 0x1F1E1D1C, 0x23222120, 0x27262524).AsByte();
+            Vector128<byte> decLutTwo4 = Vector128.Create(0x2B2A2928, 0x2F2E2D2C, 0x33323130, 0xFFFFFFFF).AsByte();


It's not clear why these all need to be declared up front. It's effectively defining unnecessary locals and work for the JIT to do.

This could have been var decLutOne = (Vector128<byte>.AllBitsSet, Vector128<byte>.AllBitsSet, Vector128.Create(0xFFFFFFFF, 0xFFFFFFFF, 0x3EFFFFFF, 0x3FFFFFFF).AsByte(), Vector128.Create(0x37363534, 0x3B3A3938, 0xFFFF3D3C, 0xFFFFFFFF).AsByte()) instead, as an example.

tannergooding · 2024-04-14T14:31:57Z

src/libraries/System.Private.CoreLib/src/System/Buffers/Text/Base64Decoder.cs

+            Vector128<byte> decOne1;
+            Vector128<byte> decOne2;
+            Vector128<byte> decOne3;
+            Vector128<byte> decOne4;
+            Vector128<byte> decTwo1;
+            Vector128<byte> decTwo2;
+            Vector128<byte> decTwo3;
+            Vector128<byte> decTwo4;
+            Vector128<byte> str1;
+            Vector128<byte> str2;
+            Vector128<byte> str3;
+            Vector128<byte> str4;
+            Vector128<byte> res1;
+            Vector128<byte> res2;
+            Vector128<byte> res3;


Similar here, it's not clear why these were defined up front like this. They could have been declared in the loop at the declaration site (since they aren't used outside the loop) making the overall code much more readable, without needing 15 up front declarations

tannergooding · 2024-04-14T14:34:15Z

src/libraries/System.Private.CoreLib/src/System/Buffers/Text/Base64Decoder.cs

+
+            byte* src = srcBytes;
+            byte* dest = destBytes;
+            Vector128<byte> offset = AdvSimd.DuplicateToVector128((byte)0x3F);


Vector128.Create<byte>(0x3F) is preferred over AdvSimd.DuplicateToVector128.

The former allows more general optimizations to kick in as it does not prescribe a particular instruction be emitted.

tannergooding · 2024-04-14T14:35:21Z

src/libraries/System.Private.CoreLib/src/System/Buffers/Text/Base64Decoder.cs

+            do
+            {
+                // Load 64 bytes and de-interleave.
+                AssertRead<Vector128<byte>>(src, srcStart, sourceLength);


I know this is pre-existing to the PR, but this is not a great method name. It should probably be something like AssertReadIsValid or similar.

tannergooding · 2024-04-14T14:36:05Z

src/libraries/System.Private.CoreLib/src/System/Buffers/Text/Base64Decoder.cs

+            {
+                // Load 64 bytes and de-interleave.
+                AssertRead<Vector128<byte>>(src, srcStart, sourceLength);
+                (str1, str2, str3, str4) = AdvSimd.Arm64.LoadVector128x4AndUnzip(src);


It would be nice if LoadVector128x4AndUnzip was something we could pattern match and implicitly light up.

Did you mean a wrapper method, say LoadVectorAndUnzip(), that would do the right thing based on the return type?

No, rather as per my comment above (#100589 (comment)) we prefer to write xplat code so logic can be shared across CPU architectures where possible.

To facilitate this, we ideally have the JIT recognize simple patterns and optimize them. LoadVector128x4, for example, could be done automatically if the JIT were to recognize 4x sequential Vector128.Load.

LoadVector128x4AndUnzip could theoretically have the same done if we were to recognize 4x sequence Vector128.Load and a general pattern for unzip. Today that would be recognizing Shuffle in particular, but longer term it might be a good reason for there to be an xplat Vector128.Unzip (since that just corresponds to Unzip on Arm64 and Unpack on x64).

The driving question tends to be, "does this logic actually have to be platform/ISA specific, or is there a simple pattern we could enable the JIT to recognize instead?". In the case simple pattern recognition is possible, then it's generally ideal to enable it so that existing user code lights up and gets faster on Arm64 without the need for users to go and manually do the work. It also lets the same code automatically light-up for things like AdvSimd vs SVE, rather than being strictly tied down to AdvSimd.

Right, thanks for this clarification. I'll keep this mind for the common patterns in the future.

tannergooding · 2024-04-14T14:44:08Z

src/libraries/System.Private.CoreLib/src/System/Buffers/Text/Base64Decoder.cs

+                res1 = AdvSimd.ShiftLeftLogical(str1, 2) | AdvSimd.ShiftRightLogical(str2, 4);
+                res2 = AdvSimd.ShiftLeftLogical(str2, 4) | AdvSimd.ShiftRightLogical(str3, 2);
+                res3 = AdvSimd.ShiftLeftLogical(str3, 6) | str4;


These could have been of the form res1 = (str1 << 2) | (str2 << 4);

tannergooding · 2024-04-14T14:44:44Z

src/libraries/System.Private.CoreLib/src/System/Buffers/Text/Base64Decoder.cs

+                Vector128<byte> classified = AdvSimd.CompareGreaterThan(str1, offset)
+                                           | AdvSimd.CompareGreaterThan(str2, offset)
+                                           | AdvSimd.CompareGreaterThan(str3, offset)
+                                           | AdvSimd.CompareGreaterThan(str4, offset);


These could have been the xplat Vector128.GreaterThan(str1, offset)

I didn't understand the xplat bit. Could you elaborate on this, please?

We have APIs defined on Vector128 that expose functionality common across multiple ISAs (AdvSimd, SVE, etc) and common across multiple architectures (Arm64, x64, WASM, etc).

Using these APIs tends to be preferred over using the platform specific APIs, particularly when there is a clear 1-to-1 mapping. Thus x + y is preferred over AdvSimd.Add and Vector128.GreaterThan is preffered over AdvSimd.CompareGreaterThan.

Using the xplat (cross platform) APIs gives the JIT more flexibility in the code it generates and allows more portability, while also often improving readability.

Makes sense, I'll update these calls 👍

tannergooding · 2024-04-14T14:58:33Z

src/libraries/System.Private.CoreLib/src/System/Buffers/Text/Base64Decoder.cs

+                // Get indices for second LUT:
+                decTwo1 = AdvSimd.SubtractSaturate(str1, offset);
+                decTwo2 = AdvSimd.SubtractSaturate(str2, offset);
+                decTwo3 = AdvSimd.SubtractSaturate(str3, offset);
+                decTwo4 = AdvSimd.SubtractSaturate(str4, offset);
+
+                // Get values from first LUT. Out-of-range indices are set to 0.
+                decOne1 = AdvSimd.Arm64.VectorTableLookup(decLutOne, str1);
+                decOne2 = AdvSimd.Arm64.VectorTableLookup(decLutOne, str2);
+                decOne3 = AdvSimd.Arm64.VectorTableLookup(decLutOne, str3);
+                decOne4 = AdvSimd.Arm64.VectorTableLookup(decLutOne, str4);
+
+                // Get values from second LUT. Out-of-range indices are unchanged.
+                decTwo1 = AdvSimd.Arm64.VectorTableLookupExtension(decTwo1, decLutTwo, decTwo1);
+                decTwo2 = AdvSimd.Arm64.VectorTableLookupExtension(decTwo2, decLutTwo, decTwo2);
+                decTwo3 = AdvSimd.Arm64.VectorTableLookupExtension(decTwo3, decLutTwo, decTwo3);
+                decTwo4 = AdvSimd.Arm64.VectorTableLookupExtension(decTwo4, decLutTwo, decTwo4);
+
+                // Invalid values are set to 255 during above look-ups using 'decLutTwo' and 'decLutTwo'.
+                // Thus the intermediate results 'decOne' and 'decTwo' could be OR-ed to get final values.
+                str1 = decOne1 | decTwo1;
+                str2 = decOne2 | decTwo2;
+                str3 = decOne3 | decTwo3;
+                str4 = decOne4 | decTwo4;


This general algorithm probably could do with a comment walking through what's happening so the "magic" is better understood, much as exists for Vector128Decode

Agreed, it's difficult to follow. Although, it's a bit tricky to describe the entire workings with lookup tables and interleaving. I'll try to make it less opaque.

SwapnilGaikwad · 2024-04-15T14:14:48Z

Hi @tannergooding , thanks for your comments. I'll add a follow-up PR incorporating as many suggestions as I can.

SwapnilGaikwad · 2024-04-26T18:37:33Z

Hi @tannergooding , thanks for your comments. I'll add a follow-up PR incorporating as many suggestions as I can.

#101620 addresses above suggestions.

* Use multi-reg load/store for DecodeFromUTF8 * Address review comments

Use multi-reg load/store for DecodeFromUTF8

a7931f5

dotnet-issue-labeler bot added the needs-area-label An area label is needed to ensure this gets routed to the appropriate area owners label Apr 3, 2024

dotnet-policy-service bot added the community-contribution Indicates that the PR has been added by a community member label Apr 3, 2024

MihaZupan added area-System.Memory and removed needs-area-label An area label is needed to ensure this gets routed to the appropriate area owners labels Apr 3, 2024

SwapnilGaikwad marked this pull request as ready for review April 3, 2024 18:14

BruceForstall added the arch-arm64 label Apr 5, 2024

a74nh reviewed Apr 8, 2024

View reviewed changes

src/libraries/System.Private.CoreLib/src/System/Buffers/Text/Base64Decoder.cs Outdated Show resolved Hide resolved

a74nh approved these changes Apr 8, 2024

View reviewed changes

kunalspathak reviewed Apr 8, 2024

View reviewed changes

src/libraries/System.Private.CoreLib/src/System/Buffers/Text/Base64Decoder.cs Outdated Show resolved Hide resolved

kunalspathak requested changes Apr 8, 2024

View reviewed changes

SwapnilGaikwad added 2 commits April 9, 2024 17:47

Merge main

d881cdb

Address review comments

1ab9bcf

kunalspathak approved these changes Apr 14, 2024

View reviewed changes

kunalspathak merged commit fa1164c into dotnet:main Apr 14, 2024
146 of 150 checks passed

tannergooding reviewed Apr 14, 2024

View reviewed changes

This was referenced Apr 15, 2024

Add support for containing RotateLeft/RotateRight on Arm64 #101030

Merged

SPMI for Arm64 failing asserting for argReg == argNode->GetRegNum() #101070

Closed

DrewScoggins mentioned this pull request Apr 18, 2024

[Perf] Linux/arm64: 14 Improvements on 4/14/2024 4:00:45 AM dotnet/perf-autofiling-issues#32886

Closed

SwapnilGaikwad mentioned this pull request Apr 26, 2024

Refactor AdvSimd version of DecodeFromUTF8 #101620

Merged

SwapnilGaikwad deleted the github-decodeFromUtf8 branch April 26, 2024 18:36

matouskozak pushed a commit to matouskozak/runtime that referenced this pull request Apr 30, 2024

Use multi-reg load/store for DecodeFromUTF8 (dotnet#100589)

607b857

* Use multi-reg load/store for DecodeFromUTF8 * Address review comments

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use multi-reg load/store for DecodeFromUTF8 #100589

Use multi-reg load/store for DecodeFromUTF8 #100589

SwapnilGaikwad commented Apr 3, 2024

SwapnilGaikwad commented Apr 3, 2024

SwapnilGaikwad commented Apr 5, 2024

kunalspathak commented Apr 5, 2024

a74nh Apr 8, 2024

kunalspathak Apr 8, 2024

tannergooding Apr 8, 2024 •

edited

a74nh left a comment

kunalspathak left a comment

SwapnilGaikwad commented Apr 9, 2024

SwapnilGaikwad commented Apr 13, 2024

kunalspathak left a comment

tannergooding Apr 14, 2024

tannergooding Apr 14, 2024

tannergooding Apr 14, 2024 •

edited

tannergooding Apr 14, 2024

tannergooding Apr 14, 2024

tannergooding Apr 14, 2024

SwapnilGaikwad Apr 15, 2024

tannergooding Apr 15, 2024

SwapnilGaikwad Apr 15, 2024

tannergooding Apr 14, 2024

tannergooding Apr 14, 2024

SwapnilGaikwad Apr 15, 2024

tannergooding Apr 15, 2024

SwapnilGaikwad Apr 15, 2024

tannergooding Apr 14, 2024

SwapnilGaikwad Apr 15, 2024

SwapnilGaikwad commented Apr 15, 2024

SwapnilGaikwad commented Apr 26, 2024

		Vector128<byte> decLutOne1 = Vector128.Create(0xFFFFFFFF, 0xFFFFFFFF, 0xFFFFFFFF, 0xFFFFFFFF).AsByte();
		Vector128<byte> decLutOne2 = Vector128.Create(0xFFFFFFFF, 0xFFFFFFFF, 0xFFFFFFFF, 0xFFFFFFFF).AsByte();

Use multi-reg load/store for DecodeFromUTF8 #100589

Use multi-reg load/store for DecodeFromUTF8 #100589

Conversation

SwapnilGaikwad commented Apr 3, 2024

SwapnilGaikwad commented Apr 3, 2024

SwapnilGaikwad commented Apr 5, 2024

kunalspathak commented Apr 5, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tannergooding Apr 8, 2024 • edited

Choose a reason for hiding this comment

a74nh left a comment

Choose a reason for hiding this comment

kunalspathak left a comment

Choose a reason for hiding this comment

SwapnilGaikwad commented Apr 9, 2024

SwapnilGaikwad commented Apr 13, 2024

kunalspathak left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tannergooding Apr 14, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SwapnilGaikwad commented Apr 15, 2024

SwapnilGaikwad commented Apr 26, 2024

tannergooding Apr 8, 2024 •

edited

tannergooding Apr 14, 2024 •

edited