Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Get index of first non ascii byte #39506
Get index of first non ascii byte #39506
Changes from 10 commits
5d33b49
9d4182c
d22d29c
3dca000
49d87f0
6d77419
8f259e6
9dd727c
b8014d9
0152f82
0370ec9
c3008eb
6c12885
576af36
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I found a way how to find an index of a first non-ascii byte in
Vector128<byte> value
that doesn't requirebitmask
.Here is a sketch how this can be accomplished.
Here is a link to proof-of-concept code: https://gist.github.com/echesakovMSFT/b27ed28024091472db6d3ca007f34a6d#file-program-cs
There code that JIT generates for the part computing the index as follows:
The five
mov
s are redundant and shouldn't be there - I am working on analysing why JIT emits them (I suspect this is how we handleBuildDelayFreeUses
). Theoretically, the code should contain only 3 xsri
, 2 xxtn
, 1 xsli
, 1 xclz
and 1 xumov
instructions.@TamarChristinaArm Do you see any issue with this approach? Aside from only supporting little-endian.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This code also allows to compute mask non-ascii bytes by replacing
with
Although, the mask will be reversed in this case
mask.15
correspond tolane[0].7
andmask.0
corresponds tolane[15].7
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd actually like to get this PR in soon, considering our deadline is Friday. I'm thinking I'll file a follow-up issue to investigate this improvement and the improvements suggested in #39507. How does that sound?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pgovind Sure, to clarify I was not suggesting to incorporate the algorithm right now. I understand that this would be an unreasonable ask.
But I can explain motivation and how I discovered this approach. I was trying to find a way to avoid doing this in your code
since a call to
BitOperations.TrailingZeroCount(mask)
is not free and useAdvSimd.LeadingZeroCount
instead.It turns out that I ended up with a completely different approach.
We can follow up on this in .NET 6.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@echesakovMSFT In and out of itself the
TrailingZeroCount
isn't that bad because the integer version ofclz
is twice as fast as the simd version ofclz
in your example. so therbit+clz
on the genreg side is about as efficient as theclz
on the SIMD side from your example (looking at a Neoverse-N1).The example you have, while it avoids having to load the mask ends up with 7 instructions that are 2 cycles each and the
sri
andsli
go down the same pipe so they can't be executed in parallel.so I think if you amortize the cost of the mask that sequence will end up being slower.
That said, you can improve this sequence further. The issue here isn't the use of a mask, it's the use of a mask that can't be created in register. Instead take a look at the sequence I pasted #39507 (comment) which also avoid the
TrailingZeroCount
(on big endian) by using a different mask (which is also created without requiring a load). That is a sequence of 6 cycles and only blocksV1
once.But looking at the overall algorithm, the majority of the cases don't require the actual index yet (with which I mean, you only check if it contains it or not...). The helper function should really end at the
.ToScalar();
andContainsNonAsciiByte_AdvSimd
should really just be!= 0
.Also what is the expected loop iterations? if you expect the majority of the characters to be ascii then the implementation is not really taking advantage of that.
You can test both vector together without needing a mask. Essentially what you want is a slightly modified version of strlen https://github.com/ARM-software/optimized-routines/blob/224cb5f67b71757b99fe1e10b5a437c17a1d733c/string/aarch64/strlen.S#L144
now when
synd
test succeeds so you know you have an element there, you have to find in which vector the match is.You can easily do this by testing the lowpart of
synd
, so ifsynd
isx1
you testw1
usingtst w1, 0x80808080
if that succeeds you know it's in the
currentAdvSimdIndex
else insecondAdvSimdIndex
.after this you implement the index finding part
Now you still need the
>> 2
but at the point you do this you're about to add to the offset:Which should allow the JIT to lower the
>> 2
into theadd
asThis avoid you needing the mask (does need some minor tweaking for big-endian) but should be significantly smaller than what is currently being generated and should be a couple of orders faster.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@TamarChristinaArm Thank you a lot for your detailed expanation! Going forward we should definitely capture your analysis here and in other PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
np, Also FYI for maximum throughput instead of
sshr v1.16b, v1.16b, 7
you can use interpret it as a signed number and instead doWhich has the same latency as
sshr
but isn't restricted to one pipe. That's a slightly better sequence than what I pasted in #39507 (comment) and above.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and you do need an
rbit
above for little endian, I'll update the comments. for this case anLD1
with two regs may be faste than anLDP
but that needs some benchmarking, it does make it easier to handle the endian differencesThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, we haven't had a chance to finish #33972 in .NET 5 - so, in this case, using
Vector64/128<T>.Zero
would require to zero a register and use 3-reg form ofcmlt
- this is why I suggested to usesshr
at the first place.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm well you can declare the use of
Vector64/128<T>.Zero
to outside the loop so you pay for it only once, and assuming you don't run out of registers it should still be better assshr
is restricted toV1
whilecmlt
can go in any pipe. So in a non-micro benchmark thecmlt
sequence would likely finish earlier.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm assuming the bitmask is intentionally defined as a local?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You throw
PlatformNotSupportedException()
insideGetIndexOfFirstNonAsciiByteInLane()
if!BitConverter.IsLittleEndian
. Why are you creating thebitMask
for that case to begin with?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup. Declaring it static won't worry because of initializing issues in either .NET Standard or .NET Framework (I don't remember which. I just remember @carlossanlop's conclusion)
Yup, the idea is that this part will just work if we start supporting BigEndian environments
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I remember adding a comment about this variable. I think you should use different variables to represent this value for
SSE2
andAdvSimd
. It is little confusing and hard to maintain with future changes.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, but I think this is a reasonable compromise. More info here where I replied to your question :) #39506 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, somehow I missed it. I have responded to your comment in the same thread.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, fixed now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same for this label, it seems like duplicate computations are being made.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmm, not really. Both the SSE and AdvSimd paths can come here, so we need to special case this label for both paths