Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Get index of first non ascii byte #39506

Merged
merged 14 commits into from
Aug 13, 2020

Conversation

pgovind
Copy link
Contributor

@pgovind pgovind commented Jul 17, 2020

Can be reviewed now. I'll add perf numbers in a bit.

Implements GetIndexOfFirstNonAsciiByte from #35034

@jkotas jkotas added the NO-REVIEW Experimental/testing PR, do NOT review it label Jul 17, 2020
@pgovind pgovind force-pushed the GetIndexOfFirstNonAsciiByte branch from 458145e to 613caec Compare July 29, 2020 21:39
@pgovind pgovind added area-System.Runtime.Intrinsics and removed NO-REVIEW Experimental/testing PR, do NOT review it labels Jul 29, 2020
@ghost
Copy link

ghost commented Jul 29, 2020

Tagging subscribers to this area: @tannergooding
See info in area-owners.md if you want to be subscribed.

Copy link
Member

@GrabYourPitchforks GrabYourPitchforks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just two nits. Don't let it block you.

@pgovind
Copy link
Contributor Author

pgovind commented Jul 30, 2020

Updated Perf Results:


total diff: 1

| Faster                                 | base/diff | Base Median (ns) | Diff Median (ns) | Modality|
| -------------------------------------- | ---------:| ----------------:| ----------------:| --------:|
| System.Text.Experimental.Perf.IsAscii  |      1.06 |         14943.78 |         14523.75 |         |

No file given```

Copy link
Member

@kunalspathak kunalspathak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are using incorrect GetNonAsciiBytes(). For your purpose, you need GetIndexOfFirstNonAsciiByte().

Copy link
Member

@GrabYourPitchforks GrabYourPitchforks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(undoing previous approval since many recent changes)

Copy link
Member

@GrabYourPitchforks GrabYourPitchforks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(undoing previous approval since many recent changes)

Copy link
Member

@GrabYourPitchforks GrabYourPitchforks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(undoing previous approval since many recent changes)

@GrabYourPitchforks
Copy link
Member

GitHub really doesn't make it easy to remove that green check mark. :(

@pgovind pgovind force-pushed the GetIndexOfFirstNonAsciiByte branch from 24eacf8 to 5d33b49 Compare July 31, 2020 01:10
Prashanth Govindarajan added 2 commits July 30, 2020 19:35
Copy link
Member

@lewing lewing left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Visibility seems wrong for something that is consumed from private methods?

@pgovind pgovind force-pushed the GetIndexOfFirstNonAsciiByte branch from a2ea59c to 8f259e6 Compare July 31, 2020 22:10
@pgovind pgovind force-pushed the GetIndexOfFirstNonAsciiByte branch from 9cbe471 to 0152f82 Compare August 6, 2020 23:19
Debug.Assert(Sse2.IsSupported || AdvSimd.Arm64.IsSupported, "Sse2 or AdvSimd64 required.");
Debug.Assert(BitConverter.IsLittleEndian, "This SSE2/Arm64 implementation assumes little-endian.");

Vector128<byte> bitmask = BitConverter.IsLittleEndian ?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm assuming the bitmask is intentionally defined as a local?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You throw PlatformNotSupportedException() inside GetIndexOfFirstNonAsciiByteInLane() if !BitConverter.IsLittleEndian. Why are you creating the bitMask for that case to begin with?

Copy link
Contributor Author

@pgovind pgovind Aug 10, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm assuming the bitmask is intentionally defined as a local?

Yup. Declaring it static won't worry because of initializing issues in either .NET Standard or .NET Framework (I don't remember which. I just remember @carlossanlop's conclusion)

Why are you creating the bitMask for that case to begin with?

Yup, the idea is that this part will just work if we start supporting BigEndian environments

}

FoundNonAsciiDataInCurrentMask:

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same for this label, it seems like duplicate computations are being made.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm, not really. Both the SSE and AdvSimd paths can come here, so we need to special case this label for both paths

{
goto FoundNonAsciiDataInCurrentMask;
currentSseMaskOrAdvSimdIndex = (uint)Sse2.MoveMask(Sse2.LoadVector128(pBuffer)); // unaligned load
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

currentSseMaskOrAdvSimdIndex [](start = 16, length = 28)

I remember adding a comment about this variable. I think you should use different variables to represent this value for SSE2 and AdvSimd. It is little confusing and hard to maintain with future changes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, but I think this is a reasonable compromise. More info here where I replied to your question :) #39506 (comment)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, somehow I missed it. I have responded to your comment in the same thread.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, fixed now.

@kunalspathak
Copy link
Member

Do you mind sending a PR to dotnet/performance to add IsAscii() benchmark? I don't see it there. Also, can you please share the code of benchmark?

@pgovind
Copy link
Contributor Author

pgovind commented Aug 10, 2020

Do you mind sending a PR to dotnet/performance to add IsAscii() benchmark? I don't see it there. Also, can you please share the code of benchmark?

Here's the local benchmark I was using dotnet/performance#1445

Copy link
Member

@kunalspathak kunalspathak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good. Just added some minor comments.


// calculate the index
int index = BitOperations.TrailingZeroCount(mask) >> 2;
Debug.Assert((mask != 0) ? index < 16 : index >= 16);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found a way how to find an index of a first non-ascii byte in Vector128<byte> value that doesn't require bitmask.
Here is a sketch how this can be accomplished.

           // value: Vector128<byte>
           // Vn.B[15] : [ b15 * * * * * * * ]
           // Vn.B[14] : [ b14 * * * * * * * ]

           Vector128<ushort> value2 = value.AsUInt16();
           // Vn.H[7] : [ b15 * * * * * * * b14 * * * * * * * ]
           // Vn.H[6] : [ b13 * * * * * * * b12 * * * * * * * ]

           Vector128<ushort> shiftToRightAndInsert = AdvSimd.ShiftRightAndInsert(value2, value2, 9);
           // Vn.H[7] : [ b15 * * * * * * * b14 b15 * * * * * * ]
           // Vn.H[6] : [ b13 * * * * * * * b12 b13 * * * * * * ]

           Vector64<ushort> shiftToRightAndInsert2 = AdvSimd.ExtractNarrowingLower(shiftToRightAndInsert).AsUInt16();
           // Vn.B[7] : [ b14 b15 * * * * * * ]
           // Vn.B[6] : [ b12 b13 * * * * * * ]
           
           // Vn.H[3] : [ b14 b15 * * * * * * b12 b13 * * * * * * ]
           // Vn.H[2] : [ b10 b11 * * * * * * b08 b09 * * * * * * ]

           Vector64<ushort> shiftToRightAndInsert3 = AdvSimd.ShiftRightAndInsert(shiftToRightAndInsert2, shiftToRightAndInsert2, 10);
           // Vn.H[3] : [ b14 b15 * * * * * * b12 b13 b14 b15 * * * * ]
           // Vn.H[2] : [ b10 b11 * * * * * * b08 b09 b10 b11 * * * * ]

           Vector64<ushort> shiftToRightAndInsert4 = AdvSimd.ExtractNarrowingLower(shiftToRightAndInsert3.ToVector128Unsafe()).AsUInt16();
           // Vn.B[3] : [ b12 b13 b14 b15 * * * * ]
           // Vn.B[2] : [ b08 b09 b10 b11 * * * * ]

           // Vn.H[1] : [ b12 b13 b14 b15 * * * * b08 b09 b10 b11 * * * * ]
           // Vn.H[0] : [ b04 b05 b06 b07 * * * * b00 b01 b02 b03 * * * * ]

           Vector64<uint> shiftToRightAndInsert5 = AdvSimd.ShiftRightAndInsert(shiftToRightAndInsert4, shiftToRightAndInsert4, 12).AsUInt32();
           // Vn.H[1] : [ b12 b13 b14 b15 * * * * b08 b09 b10 b11 b12 b13 b14 b15 ]
           // Vn.H[0] : [ b04 b05 b06 b07 * * * * b00 b01 b02 b03 b04 b05 b06 b07 ]

           // Vn.S[0] : [ b12 b13 b14 b15 * * * * b08 b09 b10 b11 b12 b13 b14 b15 b04 b05 b06 b07 * * * * b00 b01 b02 b03 b04 b05 b06 b07 ]

           Vector64<ushort> shiftToLeftAndInsert = AdvSimd.ShiftLeftAndInsert(shiftToRightAndInsert5, shiftToRightAndInsert5, 24).AsUInt16();
           // Vn.S[0] : [ b00 b01 b02 b03 b04 b05 b06 b07 b08 b09 b10 b11 b12 b13 b14 b15 b04 b05 b06 b07 * * * * b00 b01 b02 b03 b04 b05 b06 b07 ]

           // Vn.H[1] : [ b00 b01 b02 b03 b04 b05 b06 b07 b08 b09 b10 b11 b12 b13 b14 b15 ]

           ushort index = AdvSimd.LeadingZeroCount(shiftToLeftAndInsert).GetElement(1);

Here is a link to proof-of-concept code: https://gist.github.com/echesakovMSFT/b27ed28024091472db6d3ca007f34a6d#file-program-cs

There code that JIT generates for the part computing the index as follows:

        4EB01E11          mov     v17.16b, v16.16b
        6F174611          sri     v17.8h, v16.8h, #9
        0E212A30          xtn     v16.8b, v17.8h
        0EB01E11          mov     v17.8b, v16.8b
        2F164611          sri     v17.4h, v16.4h, #10
        0EB11E30          mov     v16.8b, v17.8b
        0E212A10          xtn     v16.8b, v16.8h
        0EB01E11          mov     v17.8b, v16.8b
        2F144611          sri     v17.4h, v16.4h, #12
        0EB11E30          mov     v16.8b, v17.8b
        2F385630          sli     v16.2s, v17.2s, #24
        2E604A10          clz     v16.4h, v16.4h
        0E063E00          umov    w0, v16.h[1]

The five movs are redundant and shouldn't be there - I am working on analysing why JIT emits them (I suspect this is how we handle BuildDelayFreeUses). Theoretically, the code should contain only 3 x sri, 2 x xtn, 1 x sli, 1 x clz and 1 x umov instructions.

@TamarChristinaArm Do you see any issue with this approach? Aside from only supporting little-endian.

Copy link
Contributor

@echesakov echesakov Aug 12, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code also allows to compute mask non-ascii bytes by replacing

ushort index = AdvSimd.LeadingZeroCount(shiftToLeftAndInsert).GetElement(1);

with

ushort mask = shiftToLeftAndInsert.GetElement(1);

Although, the mask will be reversed in this case mask.15 correspond to lane[0].7 and mask.0 corresponds to lane[15].7

Copy link
Contributor Author

@pgovind pgovind Aug 12, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd actually like to get this PR in soon, considering our deadline is Friday. I'm thinking I'll file a follow-up issue to investigate this improvement and the improvements suggested in #39507. How does that sound?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does that sound?

@pgovind Sure, to clarify I was not suggesting to incorporate the algorithm right now. I understand that this would be an unreasonable ask.

But I can explain motivation and how I discovered this approach. I was trying to find a way to avoid doing this in your code

            ulong mask = extractedBits.AsUInt64().ToScalar();

            // calculate the index
            int index = BitOperations.TrailingZeroCount(mask) >> 2;

since a call to BitOperations.TrailingZeroCount(mask) is not free and use AdvSimd.LeadingZeroCount instead.

It turns out that I ended up with a completely different approach.
We can follow up on this in .NET 6.

Copy link
Contributor

@TamarChristinaArm TamarChristinaArm Aug 13, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@echesakovMSFT In and out of itself the TrailingZeroCount isn't that bad because the integer version of clz is twice as fast as the simd version of clz in your example. so the rbit+clz on the genreg side is about as efficient as the clz on the SIMD side from your example (looking at a Neoverse-N1).

The example you have, while it avoids having to load the mask ends up with 7 instructions that are 2 cycles each and the sri and sli go down the same pipe so they can't be executed in parallel.

so I think if you amortize the cost of the mask that sequence will end up being slower.

That said, you can improve this sequence further. The issue here isn't the use of a mask, it's the use of a mask that can't be created in register. Instead take a look at the sequence I pasted #39507 (comment) which also avoid the TrailingZeroCount (on big endian) by using a different mask (which is also created without requiring a load). That is a sequence of 6 cycles and only blocks V1 once.

But looking at the overall algorithm, the majority of the cases don't require the actual index yet (with which I mean, you only check if it contains it or not...). The helper function should really end at the .ToScalar(); and ContainsNonAsciiByte_AdvSimd should really just be != 0.

Also what is the expected loop iterations? if you expect the majority of the characters to be ascii then the implementation is not really taking advantage of that.

	currentAdvSimdIndex = (uint)GetIndexOfFirstNonAsciiByteInLane_AdvSimd(firstVector, bitmask);
	secondAdvSimdIndex = (uint)GetIndexOfFirstNonAsciiByteInLane_AdvSimd(secondVector, bitmask);
	if (ContainsNonAsciiByte_AdvSimd(currentAdvSimdIndex) || ContainsNonAsciiByte_AdvSimd(secondAdvSimdIndex))

You can test both vector together without needing a mask. Essentially what you want is a slightly modified version of strlen https://github.com/ARM-software/optimized-routines/blob/224cb5f67b71757b99fe1e10b5a437c17a1d733c/string/aarch64/strlen.S#L144

	ldp	currentAdvSimdIndex, secondAdvSimdIndex, [src, 32]!
	sminp	maskv.16b, currentAdvSimdIndex.16b, secondAdvSimdIndex.16b
	sminp	maskv.16b, maskv.16b, maskv.16b
	fmov	synd, maskd
	tst     synd, 0x8080808080808080

now when synd test succeeds so you know you have an element there, you have to find in which vector the match is.

You can easily do this by testing the lowpart of synd, so if synd is x1 you test w1 using tst w1, 0x80808080
if that succeeds you know it's in the currentAdvSimdIndex else in secondAdvSimdIndex.

after this you implement the index finding part

	sshr	v1.16b, v1.16b, 7
	bic	v1.8h, 0x0f, lsl 8
	umaxp	v1.16b, v1.16b, v1.16b
	fmov	x1, d1
	rbit	x0, x0
	clz	x0, x1

Now you still need the >> 2 but at the point you do this you're about to add to the offset:

pBuffer += currentAdvSimdIndex;

Which should allow the JIT to lower the >> 2 into the add as

add	len, len, x0, lsr 2

This avoid you needing the mask (does need some minor tweaking for big-endian) but should be significantly smaller than what is currently being generated and should be a couple of orders faster.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@TamarChristinaArm Thank you a lot for your detailed expanation! Going forward we should definitely capture your analysis here and in other PR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

np, Also FYI for maximum throughput instead of sshr v1.16b, v1.16b, 7 you can use interpret it as a signed number and instead do

cmlt    v0.16b, v0.16b, #0

Which has the same latency as sshr but isn't restricted to one pipe. That's a slightly better sequence than what I pasted in #39507 (comment) and above.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and you do need an rbit above for little endian, I'll update the comments. for this case an LD1 with two regs may be faste than an LDP but that needs some benchmarking, it does make it easier to handle the endian differences

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

np, Also FYI for maximum throughput instead of sshr v1.16b, v1.16b, 7 you can use interpret it as a signed number and instead do

cmlt    v0.16b, v0.16b, #0

Which has the same latency as sshr but isn't restricted to one pipe. That's a slightly better sequence than what I pasted in #39507 (comment) and above.

Actually, we haven't had a chance to finish #33972 in .NET 5 - so, in this case, using Vector64/128<T>.Zero would require to zero a register and use 3-reg form of cmlt - this is why I suggested to use sshr at the first place.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, we haven't had a chance to finish #33972 in .NET 5 - so, in this case, using Vector64/128.Zero would require to zero a register and use 3-reg form of cmlt - this is why I suggested to use sshr at the first place.

Hmm well you can declare the use of Vector64/128<T>.Zero to outside the loop so you pay for it only once, and assuming you don't run out of registers it should still be better as sshr is restricted to V1 while cmlt can go in any pipe. So in a non-micro benchmark the cmlt sequence would likely finish earlier.

@pgovind
Copy link
Contributor Author

pgovind commented Aug 13, 2020

Latest perf numbers on my machine running WSL2 (I couldn't get it to run on Windows, so this is the next best thing):

No Slower results for the provided threshold = 0.1% and noise filter = 0.3ns.

| Faster                                                            | base/diff | Base Median (ns) | Diff Median (ns) | Modality|
| ----------------------------------------------------------------- | ---------:| ----------------:| ----------------:| -------- |
| System.Text.Experimental.Perf.ToUtf16_nonascii_110                |      3.90 |        664380.29 |        170197.37 |         |
| System.Text.Experimental.Perf.IsAscii_GetIndexOfFirstNonAsciiByte |      3.64 |        134643.28 |         36941.30 |         |
| System.Text.Experimental.Perf.IsNormalized_WidenAsciiToUtf16      |      2.72 |       1280532.46 |        470127.60 | bimodal |
| System.Text.Experimental.Perf.ToUtf16_nonascii_cyrillic           |      2.51 |        393356.80 |        156837.70 |         |
| System.Text.Experimental.Perf.ToUtf16_nonascii_chinese            |      1.87 |       1222065.89 |        654866.60 |         |

No file given

These numbers high enough to leave me a little bit skeptical. The GetIndexOfFirstNonAsciiByte test cases is a long string with no non-ascii bytes though, so it's not inconceivable that we're seeing significant gains(The WidenAscii case has a bi-modal distribution, so take those numbers with a pinch of salt). I'm not sure how being on WSL2 affects perf. I am however reasonably convinced that we are seeing some perf gains from this PR. Once we branch off on Friday and produce an official build, I can re-run this benchmark and get numbers. Because the methods being measured here call each other and affect perf, those numbers will be a much better representation of the overall perf gain.

cc @kunalspathak

@kunalspathak
Copy link
Member

Thanks @pgovind . I went through your benchmark PR dotnet/performance#1445 and added some comments to eliminate any other noise. I am OK merging it at this point. Just want to double check with @echesakovMSFT on his thoughts on running benchmarks on WSL2.

@pgovind
Copy link
Contributor Author

pgovind commented Aug 13, 2020

CI failures are unrelated. Merging. Thanks for the reviews everyone!

@pgovind pgovind merged commit 1de41bb into dotnet:master Aug 13, 2020
@kunalspathak
Copy link
Member

If you don't mind, can you have open a follow-up issue to optimize the code as per Tamar's suggestions?

@pgovind
Copy link
Contributor Author

pgovind commented Aug 13, 2020

Yup, filed #40805

@ghost ghost locked as resolved and limited conversation to collaborators Dec 8, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

9 participants