Get index of first non ascii byte #39506

pgovind · 2020-07-17T08:10:35Z

Can be reviewed now. I'll add perf numbers in a bit.

Implements GetIndexOfFirstNonAsciiByte from #35034

ghost · 2020-07-29T23:18:46Z

Tagging subscribers to this area: @tannergooding
See info in area-owners.md if you want to be subscribed.

GrabYourPitchforks

Just two nits. Don't let it block you.

src/libraries/System.Private.CoreLib/src/System/Text/ASCIIUtility.cs

src/libraries/System.Private.CoreLib/src/System/Text/Unicode/Utf16Utility.Validation.cs

pgovind · 2020-07-30T02:09:23Z

Updated Perf Results:


total diff: 1

| Faster                                 | base/diff | Base Median (ns) | Diff Median (ns) | Modality|
| -------------------------------------- | ---------:| ----------------:| ----------------:| --------:|
| System.Text.Experimental.Perf.IsAscii  |      1.06 |         14943.78 |         14523.75 |         |

No file given```

kunalspathak

You are using incorrect GetNonAsciiBytes(). For your purpose, you need GetIndexOfFirstNonAsciiByte().

src/libraries/System.Private.CoreLib/src/System/Text/ASCIIUtility.cs

src/libraries/System.Text.Json/src/System/Text/Json/JsonEncodedText.cs

GrabYourPitchforks

(undoing previous approval since many recent changes)

GrabYourPitchforks

(undoing previous approval since many recent changes)

GrabYourPitchforks

(undoing previous approval since many recent changes)

GrabYourPitchforks · 2020-07-30T23:35:09Z

GitHub really doesn't make it easy to remove that green check mark. :(

lewing

Visibility seems wrong for something that is consumed from private methods?

src/libraries/System.Private.CoreLib/src/System/Text/ASCIIUtility.cs

eiriktsarpalis · 2020-08-10T08:22:08Z

src/libraries/System.Private.CoreLib/src/System/Text/ASCIIUtility.cs

+            Debug.Assert(Sse2.IsSupported || AdvSimd.Arm64.IsSupported, "Sse2 or AdvSimd64 required.");
+            Debug.Assert(BitConverter.IsLittleEndian, "This SSE2/Arm64 implementation assumes little-endian.");
+
+            Vector128<byte> bitmask = BitConverter.IsLittleEndian ?


I'm assuming the bitmask is intentionally defined as a local?

You throw PlatformNotSupportedException() inside GetIndexOfFirstNonAsciiByteInLane() if !BitConverter.IsLittleEndian. Why are you creating the bitMask for that case to begin with?

I'm assuming the bitmask is intentionally defined as a local?

Yup. Declaring it static won't worry because of initializing issues in either .NET Standard or .NET Framework (I don't remember which. I just remember @carlossanlop's conclusion)

Why are you creating the bitMask for that case to begin with?

Yup, the idea is that this part will just work if we start supporting BigEndian environments

src/libraries/System.Private.CoreLib/src/System/Text/ASCIIUtility.cs

eiriktsarpalis · 2020-08-10T08:27:43Z

src/libraries/System.Private.CoreLib/src/System/Text/ASCIIUtility.cs

            }
-
        FoundNonAsciiDataInCurrentMask:



Same for this label, it seems like duplicate computations are being made.

Hmmm, not really. Both the SSE and AdvSimd paths can come here, so we need to special case this label for both paths

kunalspathak · 2020-08-10T15:46:59Z

src/libraries/System.Private.CoreLib/src/System/Text/ASCIIUtility.cs

            {
-                goto FoundNonAsciiDataInCurrentMask;
+                currentSseMaskOrAdvSimdIndex = (uint)Sse2.MoveMask(Sse2.LoadVector128(pBuffer)); // unaligned load


currentSseMaskOrAdvSimdIndex [](start = 16, length = 28)

I remember adding a comment about this variable. I think you should use different variables to represent this value for SSE2 and AdvSimd. It is little confusing and hard to maintain with future changes.

I agree, but I think this is a reasonable compromise. More info here where I replied to your question :) #39506 (comment)

Ah, somehow I missed it. I have responded to your comment in the same thread.

Done, fixed now.

kunalspathak · 2020-08-10T15:57:13Z

Do you mind sending a PR to dotnet/performance to add IsAscii() benchmark? I don't see it there. Also, can you please share the code of benchmark?

pgovind · 2020-08-10T22:37:10Z

Do you mind sending a PR to dotnet/performance to add IsAscii() benchmark? I don't see it there. Also, can you please share the code of benchmark?

Here's the local benchmark I was using dotnet/performance#1445

src/libraries/System.Private.CoreLib/src/System/Text/ASCIIUtility.cs

kunalspathak

This looks good. Just added some minor comments.

src/libraries/System.Private.CoreLib/src/System/Text/ASCIIUtility.cs

echesakov · 2020-08-12T20:20:41Z

src/libraries/System.Private.CoreLib/src/System/Text/ASCIIUtility.cs

+
+            // calculate the index
+            int index = BitOperations.TrailingZeroCount(mask) >> 2;
+            Debug.Assert((mask != 0) ? index < 16 : index >= 16);


I found a way how to find an index of a first non-ascii byte in Vector128<byte> value that doesn't require bitmask.
Here is a sketch how this can be accomplished.

// value: Vector128<byte> // Vn.B[15] : [ b15 * * * * * * * ] // Vn.B[14] : [ b14 * * * * * * * ] Vector128<ushort> value2 = value.AsUInt16(); // Vn.H[7] : [ b15 * * * * * * * b14 * * * * * * * ] // Vn.H[6] : [ b13 * * * * * * * b12 * * * * * * * ] Vector128<ushort> shiftToRightAndInsert = AdvSimd.ShiftRightAndInsert(value2, value2, 9); // Vn.H[7] : [ b15 * * * * * * * b14 b15 * * * * * * ] // Vn.H[6] : [ b13 * * * * * * * b12 b13 * * * * * * ] Vector64<ushort> shiftToRightAndInsert2 = AdvSimd.ExtractNarrowingLower(shiftToRightAndInsert).AsUInt16(); // Vn.B[7] : [ b14 b15 * * * * * * ] // Vn.B[6] : [ b12 b13 * * * * * * ] // Vn.H[3] : [ b14 b15 * * * * * * b12 b13 * * * * * * ] // Vn.H[2] : [ b10 b11 * * * * * * b08 b09 * * * * * * ] Vector64<ushort> shiftToRightAndInsert3 = AdvSimd.ShiftRightAndInsert(shiftToRightAndInsert2, shiftToRightAndInsert2, 10); // Vn.H[3] : [ b14 b15 * * * * * * b12 b13 b14 b15 * * * * ] // Vn.H[2] : [ b10 b11 * * * * * * b08 b09 b10 b11 * * * * ] Vector64<ushort> shiftToRightAndInsert4 = AdvSimd.ExtractNarrowingLower(shiftToRightAndInsert3.ToVector128Unsafe()).AsUInt16(); // Vn.B[3] : [ b12 b13 b14 b15 * * * * ] // Vn.B[2] : [ b08 b09 b10 b11 * * * * ] // Vn.H[1] : [ b12 b13 b14 b15 * * * * b08 b09 b10 b11 * * * * ] // Vn.H[0] : [ b04 b05 b06 b07 * * * * b00 b01 b02 b03 * * * * ] Vector64<uint> shiftToRightAndInsert5 = AdvSimd.ShiftRightAndInsert(shiftToRightAndInsert4, shiftToRightAndInsert4, 12).AsUInt32(); // Vn.H[1] : [ b12 b13 b14 b15 * * * * b08 b09 b10 b11 b12 b13 b14 b15 ] // Vn.H[0] : [ b04 b05 b06 b07 * * * * b00 b01 b02 b03 b04 b05 b06 b07 ] // Vn.S[0] : [ b12 b13 b14 b15 * * * * b08 b09 b10 b11 b12 b13 b14 b15 b04 b05 b06 b07 * * * * b00 b01 b02 b03 b04 b05 b06 b07 ] Vector64<ushort> shiftToLeftAndInsert = AdvSimd.ShiftLeftAndInsert(shiftToRightAndInsert5, shiftToRightAndInsert5, 24).AsUInt16(); // Vn.S[0] : [ b00 b01 b02 b03 b04 b05 b06 b07 b08 b09 b10 b11 b12 b13 b14 b15 b04 b05 b06 b07 * * * * b00 b01 b02 b03 b04 b05 b06 b07 ] // Vn.H[1] : [ b00 b01 b02 b03 b04 b05 b06 b07 b08 b09 b10 b11 b12 b13 b14 b15 ] ushort index = AdvSimd.LeadingZeroCount(shiftToLeftAndInsert).GetElement(1);

Here is a link to proof-of-concept code: https://gist.github.com/echesakovMSFT/b27ed28024091472db6d3ca007f34a6d#file-program-cs

There code that JIT generates for the part computing the index as follows:

4EB01E11 mov v17.16b, v16.16b 6F174611 sri v17.8h, v16.8h, #9 0E212A30 xtn v16.8b, v17.8h 0EB01E11 mov v17.8b, v16.8b 2F164611 sri v17.4h, v16.4h, #10 0EB11E30 mov v16.8b, v17.8b 0E212A10 xtn v16.8b, v16.8h 0EB01E11 mov v17.8b, v16.8b 2F144611 sri v17.4h, v16.4h, #12 0EB11E30 mov v16.8b, v17.8b 2F385630 sli v16.2s, v17.2s, #24 2E604A10 clz v16.4h, v16.4h 0E063E00 umov w0, v16.h[1]

The five movs are redundant and shouldn't be there - I am working on analysing why JIT emits them (I suspect this is how we handle BuildDelayFreeUses). Theoretically, the code should contain only 3 x sri, 2 x xtn, 1 x sli, 1 x clz and 1 x umov instructions.

@TamarChristinaArm Do you see any issue with this approach? Aside from only supporting little-endian.

This code also allows to compute mask non-ascii bytes by replacing

ushort index = AdvSimd.LeadingZeroCount(shiftToLeftAndInsert).GetElement(1);

with

ushort mask = shiftToLeftAndInsert.GetElement(1);

Although, the mask will be reversed in this case mask.15 correspond to lane[0].7 and mask.0 corresponds to lane[15].7

I'd actually like to get this PR in soon, considering our deadline is Friday. I'm thinking I'll file a follow-up issue to investigate this improvement and the improvements suggested in #39507. How does that sound?

How does that sound?

@pgovind Sure, to clarify I was not suggesting to incorporate the algorithm right now. I understand that this would be an unreasonable ask.

But I can explain motivation and how I discovered this approach. I was trying to find a way to avoid doing this in your code

ulong mask = extractedBits.AsUInt64().ToScalar(); // calculate the index int index = BitOperations.TrailingZeroCount(mask) >> 2;

since a call to BitOperations.TrailingZeroCount(mask) is not free and use AdvSimd.LeadingZeroCount instead.

It turns out that I ended up with a completely different approach.
We can follow up on this in .NET 6.

@echesakovMSFT In and out of itself the TrailingZeroCount isn't that bad because the integer version of clz is twice as fast as the simd version of clz in your example. so the rbit+clz on the genreg side is about as efficient as the clz on the SIMD side from your example (looking at a Neoverse-N1).

The example you have, while it avoids having to load the mask ends up with 7 instructions that are 2 cycles each and the sri and sli go down the same pipe so they can't be executed in parallel.

so I think if you amortize the cost of the mask that sequence will end up being slower.

That said, you can improve this sequence further. The issue here isn't the use of a mask, it's the use of a mask that can't be created in register. Instead take a look at the sequence I pasted #39507 (comment) which also avoid the TrailingZeroCount (on big endian) by using a different mask (which is also created without requiring a load). That is a sequence of 6 cycles and only blocks V1 once.

But looking at the overall algorithm, the majority of the cases don't require the actual index yet (with which I mean, you only check if it contains it or not...). The helper function should really end at the .ToScalar(); and ContainsNonAsciiByte_AdvSimd should really just be != 0.

Also what is the expected loop iterations? if you expect the majority of the characters to be ascii then the implementation is not really taking advantage of that.

currentAdvSimdIndex = (uint)GetIndexOfFirstNonAsciiByteInLane_AdvSimd(firstVector, bitmask); secondAdvSimdIndex = (uint)GetIndexOfFirstNonAsciiByteInLane_AdvSimd(secondVector, bitmask); if (ContainsNonAsciiByte_AdvSimd(currentAdvSimdIndex) || ContainsNonAsciiByte_AdvSimd(secondAdvSimdIndex))

You can test both vector together without needing a mask. Essentially what you want is a slightly modified version of strlen https://github.com/ARM-software/optimized-routines/blob/224cb5f67b71757b99fe1e10b5a437c17a1d733c/string/aarch64/strlen.S#L144

ldp currentAdvSimdIndex, secondAdvSimdIndex, [src, 32]! sminp maskv.16b, currentAdvSimdIndex.16b, secondAdvSimdIndex.16b sminp maskv.16b, maskv.16b, maskv.16b fmov synd, maskd tst synd, 0x8080808080808080

now when synd test succeeds so you know you have an element there, you have to find in which vector the match is.

You can easily do this by testing the lowpart of synd, so if synd is x1 you test w1 using tst w1, 0x80808080
if that succeeds you know it's in the currentAdvSimdIndex else in secondAdvSimdIndex.

after this you implement the index finding part

sshr v1.16b, v1.16b, 7 bic v1.8h, 0x0f, lsl 8 umaxp v1.16b, v1.16b, v1.16b fmov x1, d1 rbit x0, x0 clz x0, x1

Now you still need the >> 2 but at the point you do this you're about to add to the offset:

pBuffer += currentAdvSimdIndex;

Which should allow the JIT to lower the >> 2 into the add as

add len, len, x0, lsr 2

This avoid you needing the mask (does need some minor tweaking for big-endian) but should be significantly smaller than what is currently being generated and should be a couple of orders faster.

@TamarChristinaArm Thank you a lot for your detailed expanation! Going forward we should definitely capture your analysis here and in other PR.

np, Also FYI for maximum throughput instead of sshr v1.16b, v1.16b, 7 you can use interpret it as a signed number and instead do

cmlt v0.16b, v0.16b, #0

Which has the same latency as sshr but isn't restricted to one pipe. That's a slightly better sequence than what I pasted in #39507 (comment) and above.

and you do need an rbit above for little endian, I'll update the comments. for this case an LD1 with two regs may be faste than an LDP but that needs some benchmarking, it does make it easier to handle the endian differences

np, Also FYI for maximum throughput instead of sshr v1.16b, v1.16b, 7 you can use interpret it as a signed number and instead do

cmlt v0.16b, v0.16b, #0

Which has the same latency as sshr but isn't restricted to one pipe. That's a slightly better sequence than what I pasted in #39507 (comment) and above.

Actually, we haven't had a chance to finish #33972 in .NET 5 - so, in this case, using Vector64/128<T>.Zero would require to zero a register and use 3-reg form of cmlt - this is why I suggested to use sshr at the first place.

Actually, we haven't had a chance to finish #33972 in .NET 5 - so, in this case, using Vector64/128.Zero would require to zero a register and use 3-reg form of cmlt - this is why I suggested to use sshr at the first place.

Hmm well you can declare the use of Vector64/128<T>.Zero to outside the loop so you pay for it only once, and assuming you don't run out of registers it should still be better as sshr is restricted to V1 while cmlt can go in any pipe. So in a non-micro benchmark the cmlt sequence would likely finish earlier.

pgovind · 2020-08-13T19:54:53Z

Latest perf numbers on my machine running WSL2 (I couldn't get it to run on Windows, so this is the next best thing):

No Slower results for the provided threshold = 0.1% and noise filter = 0.3ns.

| Faster                                                            | base/diff | Base Median (ns) | Diff Median (ns) | Modality|
| ----------------------------------------------------------------- | ---------:| ----------------:| ----------------:| -------- |
| System.Text.Experimental.Perf.ToUtf16_nonascii_110                |      3.90 |        664380.29 |        170197.37 |         |
| System.Text.Experimental.Perf.IsAscii_GetIndexOfFirstNonAsciiByte |      3.64 |        134643.28 |         36941.30 |         |
| System.Text.Experimental.Perf.IsNormalized_WidenAsciiToUtf16      |      2.72 |       1280532.46 |        470127.60 | bimodal |
| System.Text.Experimental.Perf.ToUtf16_nonascii_cyrillic           |      2.51 |        393356.80 |        156837.70 |         |
| System.Text.Experimental.Perf.ToUtf16_nonascii_chinese            |      1.87 |       1222065.89 |        654866.60 |         |

No file given

These numbers high enough to leave me a little bit skeptical. The GetIndexOfFirstNonAsciiByte test cases is a long string with no non-ascii bytes though, so it's not inconceivable that we're seeing significant gains(The WidenAscii case has a bi-modal distribution, so take those numbers with a pinch of salt). I'm not sure how being on WSL2 affects perf. I am however reasonably convinced that we are seeing some perf gains from this PR. Once we branch off on Friday and produce an official build, I can re-run this benchmark and get numbers. Because the methods being measured here call each other and affect perf, those numbers will be a much better representation of the overall perf gain.

cc @kunalspathak

kunalspathak · 2020-08-13T21:01:25Z

Thanks @pgovind . I went through your benchmark PR dotnet/performance#1445 and added some comments to eliminate any other noise. I am OK merging it at this point. Just want to double check with @echesakovMSFT on his thoughts on running benchmarks on WSL2.

pgovind · 2020-08-13T21:16:49Z

CI failures are unrelated. Merging. Thanks for the reviews everyone!

kunalspathak · 2020-08-13T21:28:10Z

If you don't mind, can you have open a follow-up issue to optimize the code as per Tamar's suggestions?

pgovind · 2020-08-13T22:13:12Z

Yup, filed #40805

Dotnet-GitSync-Bot added the area-Infrastructure-coreclr label Jul 17, 2020

pgovind removed the area-Infrastructure-coreclr label Jul 17, 2020

jkotas added the NO-REVIEW Experimental/testing PR, do NOT review it label Jul 17, 2020

pgovind force-pushed the GetIndexOfFirstNonAsciiByte branch from 458145e to 613caec Compare July 29, 2020 21:39

pgovind requested review from echesakov and kunalspathak July 29, 2020 23:18

pgovind added area-System.Runtime.Intrinsics and removed NO-REVIEW Experimental/testing PR, do NOT review it labels Jul 29, 2020

GrabYourPitchforks approved these changes Jul 29, 2020

View reviewed changes

src/libraries/System.Private.CoreLib/src/System/Text/ASCIIUtility.cs Outdated Show resolved Hide resolved

src/libraries/System.Private.CoreLib/src/System/Text/Unicode/Utf16Utility.Validation.cs Outdated Show resolved Hide resolved

pgovind requested review from Jozkee, layomia and steveharter as code owners July 30, 2020 20:06

kunalspathak reviewed Jul 30, 2020

View reviewed changes

GrabYourPitchforks reviewed Jul 30, 2020

View reviewed changes

GrabYourPitchforks requested changes Jul 30, 2020

View reviewed changes

GrabYourPitchforks reviewed Jul 30, 2020

View reviewed changes

Arm64 intrinsics for GetIndexOfFirstNonAsciiByte

5d33b49

pgovind force-pushed the GetIndexOfFirstNonAsciiByte branch from 24eacf8 to 5d33b49 Compare July 31, 2020 01:10

Prashanth Govindarajan added 2 commits July 30, 2020 19:35

sq

9d4182c

sq

d22d29c

lewing reviewed Jul 31, 2020

View reviewed changes

src/libraries/System.Private.CoreLib/src/System/Text/ASCIIUtility.cs Outdated Show resolved Hide resolved

Prashanth Govindarajan added 4 commits July 30, 2020 20:37

sq

3dca000

sq

49d87f0

Last bit of formatting

6d77419

Fix bug

8f259e6

pgovind force-pushed the GetIndexOfFirstNonAsciiByte branch from a2ea59c to 8f259e6 Compare July 31, 2020 22:10

Better naming

9dd727c

pgovind force-pushed the GetIndexOfFirstNonAsciiByte branch from 9cbe471 to 0152f82 Compare August 6, 2020 23:19

eiriktsarpalis reviewed Aug 10, 2020

View reviewed changes

src/libraries/System.Private.CoreLib/src/System/Text/ASCIIUtility.cs Outdated Show resolved Hide resolved

eiriktsarpalis reviewed Aug 10, 2020

View reviewed changes

src/libraries/System.Private.CoreLib/src/System/Text/ASCIIUtility.cs Outdated Show resolved Hide resolved

eiriktsarpalis reviewed Aug 10, 2020

View reviewed changes

src/libraries/System.Private.CoreLib/src/System/Text/ASCIIUtility.cs Outdated Show resolved Hide resolved

eiriktsarpalis reviewed Aug 10, 2020

View reviewed changes

kunalspathak reviewed Aug 10, 2020

View reviewed changes

Address nit

0370ec9

Different variables

c3008eb

pgovind commented Aug 10, 2020

View reviewed changes

src/libraries/System.Private.CoreLib/src/System/Text/ASCIIUtility.cs Outdated Show resolved Hide resolved

sq

6c12885

kunalspathak reviewed Aug 12, 2020

View reviewed changes

src/libraries/System.Private.CoreLib/src/System/Text/ASCIIUtility.cs Outdated Show resolved Hide resolved

kunalspathak reviewed Aug 12, 2020

View reviewed changes

Nits

576af36

pgovind requested a review from GrabYourPitchforks August 12, 2020 19:54

echesakov reviewed Aug 12, 2020

View reviewed changes

GrabYourPitchforks approved these changes Aug 12, 2020

View reviewed changes

pgovind merged commit 1de41bb into dotnet:master Aug 13, 2020

pgovind mentioned this pull request Aug 13, 2020

Investigate potential ARM64 perf improvements in AsciiUtility and Utf16Utility #40805

Open

kunalspathak mentioned this pull request Sep 4, 2020

[ARM64] Performance regression: Utf8Encoding #41699

Closed

runfoapp bot mentioned this pull request Sep 29, 2020

Test initial triage jaredpar/runfo#50

Closed

ghost locked as resolved and limited conversation to collaborators Dec 8, 2020

Get index of first non ascii byte #39506

Get index of first non ascii byte #39506

Conversation

pgovind commented Jul 17, 2020 • edited Loading

ghost commented Jul 29, 2020

GrabYourPitchforks left a comment

Choose a reason for hiding this comment

pgovind commented Jul 30, 2020 • edited Loading

kunalspathak left a comment

Choose a reason for hiding this comment

GrabYourPitchforks left a comment

Choose a reason for hiding this comment

GrabYourPitchforks left a comment

Choose a reason for hiding this comment

GrabYourPitchforks left a comment

Choose a reason for hiding this comment

GrabYourPitchforks commented Jul 30, 2020

lewing left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pgovind Aug 10, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kunalspathak commented Aug 10, 2020

pgovind commented Aug 10, 2020

kunalspathak left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

echesakov Aug 12, 2020 • edited Loading

Choose a reason for hiding this comment

pgovind Aug 12, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

TamarChristinaArm Aug 13, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pgovind commented Aug 13, 2020

kunalspathak commented Aug 13, 2020

pgovind commented Aug 13, 2020

kunalspathak commented Aug 13, 2020

pgovind commented Aug 13, 2020

pgovind commented Jul 17, 2020 •

edited

Loading

pgovind commented Jul 30, 2020 •

edited

Loading

pgovind Aug 10, 2020 •

edited

Loading

echesakov Aug 12, 2020 •

edited

Loading

pgovind Aug 12, 2020 •

edited

Loading

TamarChristinaArm Aug 13, 2020 •

edited

Loading