Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AdvSimd support for System.Text.Unicode.Utf16Utility.GetPointerToFirstInvalidChar #39050

Merged
merged 9 commits into from
Jul 16, 2020

Conversation

carlossanlop
Copy link
Member

@carlossanlop carlossanlop commented Jul 10, 2020

Contributes to #35035

Adds AdvSimd support for
System.Text.Unicode.Utf16Utility.GetPointerToFirstInvalidChar()
inside the file
runtime\src\libraries\System.Private.CoreLib\src\System\Text\Unicode\Utf16Utility.Validation.cs

I've been having difficulties testing this in my ARM device so I want to analyze the CI results.
I manually executed an additional "Libraries Test Run" pipeline to ensure arm64 is run in all platforms.

@ghost
Copy link

ghost commented Jul 10, 2020

Tagging subscribers to this area: @tannergooding
Notify danmosemsft if you want to be subscribed.

Improve Arm64MoveMask.
@carlossanlop carlossanlop marked this pull request as ready for review July 11, 2020 00:33
@carlossanlop
Copy link
Member Author

carlossanlop commented Jul 20, 2020

@kunalspathak The System.Text.Json string performance tests seem to be touching the code from my 3 PRs.
As suggested above, here is a comparison of SSE2 (off and on) and a comparison of ARM64 (before and after my changes).

x64 PC - SSE2 turned off vs SSE2 turned on
❯ dotnet run --base D:\results\sse2_off_json\ --diff D:\results\sse2_on_json --threshold 0.01%
summary:
better: 20, geomean: 1.185
worse: 4, geomean: 1.054
total diff: 24
Slower diff/base Base Median (ns) Diff Median (ns) Modality
System.Text.Json.Tests.Perf_Strings.WriteStringsUtf8(Formatted: True, SkipValida 1.06 51898200.00 55086725.00
System.Text.Json.Tests.Perf_Strings.WriteStringsUtf8(Formatted: True, SkipValida 1.06 51997612.50 55014975.00
System.Text.Json.Tests.Perf_Strings.WriteStringsUtf16(Formatted: True, SkipValid 1.05 65135312.50 68491050.00
System.Text.Json.Tests.Perf_Strings.WriteStringsUtf16(Formatted: False, SkipVali 1.05 64600650.00 67619275.00
Faster base/diff Base Median (ns) Diff Median (ns) Modality
System.Text.Json.Tests.Perf_Strings.WriteStringsUtf16(Formatted: False, SkipVali 1.43 11293877.27 7916990.63
System.Text.Json.Tests.Perf_Strings.WriteStringsUtf16(Formatted: True, SkipValid 1.38 11165018.18 8102564.52
System.Text.Json.Tests.Perf_Strings.WriteStringsUtf16(Formatted: False, SkipVali 1.37 9823526.00 7185817.14
System.Text.Json.Tests.Perf_Strings.WriteStringsUtf8(Formatted: False, SkipValid 1.35 5694394.32 4206723.73
System.Text.Json.Tests.Perf_Strings.WriteStringsUtf8(Formatted: False, SkipValid 1.35 5721878.41 4242511.86
System.Text.Json.Tests.Perf_Strings.WriteStringsUtf8(Formatted: True, SkipValida 1.32 6056718.75 4584055.56
System.Text.Json.Tests.Perf_Strings.WriteStringsUtf16(Formatted: True, SkipValid 1.30 15584070.00 11958591.67
System.Text.Json.Tests.Perf_Strings.WriteStringsUtf8(Formatted: True, SkipValida 1.28 5994221.43 4669331.13
System.Text.Json.Tests.Perf_Strings.WriteStringsUtf16(Formatted: True, SkipValid 1.22 15702625.00 12883245.45
System.Text.Json.Tests.Perf_Strings.WriteStringsUtf16(Formatted: True, SkipValid 1.19 9653904.00 8126522.58
System.Text.Json.Tests.Perf_Strings.WriteStringsUtf8(Formatted: True, SkipValida 1.12 9575730.77 8580234.48
System.Text.Json.Tests.Perf_Strings.WriteStringsUtf8(Formatted: True, SkipValida 1.11 9512165.38 8569112.07
System.Text.Json.Tests.Perf_Strings.WriteStringsUtf16(Formatted: False, SkipVali 1.11 12839168.42 11591104.76
System.Text.Json.Tests.Perf_Strings.WriteStringsUtf16(Formatted: True, SkipValid 1.09 71462525.00 65468425.00
System.Text.Json.Tests.Perf_Strings.WriteStringsUtf16(Formatted: False, SkipVali 1.08 12907131.58 11975097.62
System.Text.Json.Tests.Perf_Strings.WriteStringsUtf8(Formatted: False, SkipValid 1.05 9149053.70 8703067.24
System.Text.Json.Tests.Perf_Strings.WriteStringsUtf8(Formatted: False, SkipValid 1.04 9009832.14 8694486.21
System.Text.Json.Tests.Perf_Strings.WriteStringsUtf8(Formatted: False, SkipValid 1.04 53630000.00 51765800.00
System.Text.Json.Tests.Perf_Strings.WriteStringsUtf16(Formatted: False, SkipVali 1.03 69311950.00 67138525.00
System.Text.Json.Tests.Perf_Strings.WriteStringsUtf8(Formatted: False, SkipValid 1.00 51853775.00 51810525.00
ARM64 laptop - Before and after Utf16Utility.GetPointerToFirstInvalidChar changes

This PR.

root@calopearm:/home/calope/performance/src/tools/ResultsComparer# /home/calope/dotnet/dotnet run --base /home/calope/calope_master/ --diff /home/calope/calope_u16/ --threshold 0.01%summary:
better: 3, geomean: 1.121
worse: 2, geomean: 1.200
total diff: 5
Slower diff/base Base Median (ns) Diff Median (ns) Modality
System.Text.Json.Tests.Perf_Strings.WriteStringsUtf16(Formatted: False, SkipVali 1.30 23424506.25 30519845.45
System.Text.Json.Tests.Perf_Strings.WriteStringsUtf16(Formatted: True, SkipValid 1.10 112796200.00 124601150.00
Faster base/diff Base Median (ns) Diff Median (ns) Modality
System.Text.Json.Tests.Perf_Strings.WriteStringsUtf8(Formatted: True, SkipValida 1.18 22947991.67 19428000.00
System.Text.Json.Tests.Perf_Strings.WriteStringsUtf16(Formatted: False, SkipVali 1.15 135274850.00 118045100.00 several?
System.Text.Json.Tests.Perf_Strings.WriteStringsUtf16(Formatted: False, SkipVali 1.04 29242287.50 28128337.50
ARM64 laptop - Before and after Utf8Utility.TranscodeToUtf8 changes
root@calopearm:/home/calope/performance/src/tools/ResultsComparer# /home/calope/dotnet/dotnet run --base /home/calope/calope_master/ --diff /home/calope/calope_u8t/ --threshold 0.01%
summary:
better: 4, geomean: 1.179
worse: 2, geomean: 1.077
total diff: 6
Slower diff/base Base Median (ns) Diff Median (ns) Modality
System.Text.Json.Tests.Perf_Strings.WriteStringsUtf8(Formatted: True, SkipValida 1.08 98334600.00 106144900.00
System.Text.Json.Tests.Perf_Strings.WriteStringsUtf8(Formatted: False, SkipValid 1.07 95537100.00 102646150.00
Faster base/diff Base Median (ns) Diff Median (ns) Modality
System.Text.Json.Tests.Perf_Strings.WriteStringsUtf16(Formatted: False, SkipVali 1.24 135274850.00 109235700.00 several?
System.Text.Json.Tests.Perf_Strings.WriteStringsUtf8(Formatted: True, SkipValida 1.22 16219827.78 13345030.00
System.Text.Json.Tests.Perf_Strings.WriteStringsUtf16(Formatted: False, SkipVali 1.18 23424506.25 19789441.67
System.Text.Json.Tests.Perf_Strings.WriteStringsUtf8(Formatted: False, SkipValid 1.08 21448275.00 19773771.43 bimodal
ARM64 laptop - Before and after Utf8Utility.GetPointerToFirstInvalidByte changes
root@calopearm:/home/calope/performance/src/tools/ResultsComparer# /home/calope/dotnet/dotnet run --base /home/calope/calope_master/ --diff /home/calope/calope_u8v/ --threshold 0.01%
summary:
better: 3, geomean: 1.133
worse: 5, geomean: 1.184
total diff: 8
Slower diff/base Base Median (ns) Diff Median (ns) Modality
System.Text.Json.Tests.Perf_Strings.WriteStringsUtf8(Formatted: True, SkipValida 1.29 98334600.00 126692800.00
System.Text.Json.Tests.Perf_Strings.WriteStringsUtf8(Formatted: False, SkipValid 1.27 19967450.00 25323100.00
System.Text.Json.Tests.Perf_Strings.WriteStringsUtf16(Formatted: False, SkipVali 1.22 24453810.00 29717807.69
System.Text.Json.Tests.Perf_Strings.WriteStringsUtf8(Formatted: False, SkipValid 1.11 12293871.43 13701875.00
System.Text.Json.Tests.Perf_Strings.WriteStringsUtf16(Formatted: True, SkipValid 1.05 112796200.00 118735400.00 several?
Faster base/diff Base Median (ns) Diff Median (ns) Modality
System.Text.Json.Tests.Perf_Strings.WriteStringsUtf16(Formatted: True, SkipValid 1.21 27319850.00 22669837.50
System.Text.Json.Tests.Perf_Strings.WriteStringsUtf8(Formatted: True, SkipValida 1.10 105206900.00 95269000.00 bimodal
System.Text.Json.Tests.Perf_Strings.WriteStringsUtf16(Formatted: False, SkipVali 1.09 135274850.00 123671950.00 several?
ARM64 laptop - All 3 PRs merged together

This result combines the changes from all 3 PRs and shows the final performance results.

root@calopearm:/home/calope/performance/src/tools/ResultsComparer# /home/calope/dotnet/dotnet run --base /home/calope/calope_master/ --diff /home/calope/calope_all/ --threshold 0.01%
summary:
better: 2, geomean: 1.277
worse: 2, geomean: 1.196
total diff: 4
Slower diff/base Base Median (ns) Diff Median (ns) Modality
System.Text.Json.Tests.Perf_Strings.WriteStringsUtf8(Formatted: False, SkipValid 1.24 12293871.43 15192383.33
System.Text.Json.Tests.Perf_Strings.WriteStringsUtf8(Formatted: False, SkipValid 1.16 12584672.22 14562296.67 bimodal
Faster base/diff Base Median (ns) Diff Median (ns) Modality
System.Text.Json.Tests.Perf_Strings.WriteStringsUtf16(Formatted: True, SkipValid 1.32 27319850.00 20695350.00
System.Text.Json.Tests.Perf_Strings.WriteStringsUtf16(Formatted: False, SkipVali 1.23 135274850.00 109570450.00 several?

@TamarChristinaArm
Copy link
Contributor

@kunalspathak The System.Text.Json string performance tests seem to be touching the code from my 3 PRs.
As suggested above, here is a comparison of SSE2 (off and on) and a comparison of ARM64 (before and after my changes).

x64 PC - SSE2 turned off vs SSE2 turned on
ARM64 laptop - Before and after Utf16Utility.GetPointerToFirstInvalidChar changes
ARM64 laptop - Before and after Utf8Utility.TranscodeToUtf8 changes
ARM64 laptop - Before and after Utf8Utility.GetPointerToFirstInvalidByte changes
ARM64 laptop - All 3 PRs merged together

Hmm that's peculiar... Do you happen to have the codegen for the methods? :)

@kunalspathak
Copy link
Member

Just for reference : The performance benchmarks code is here.

Questions/Observations:

  1. I see lot more benchmarks run on SSE2 than on AdvSimd. I can't see their full names. Have we missed reporting any benchmarks for AdvSimd above?

  2. It looks like SSE2 is better or mostly flat (exception of 1.06 , etc which I consider noise) if we compare on vs. off. So essentially SSE2 improved the performance of all the methods that are touched in your 3 PRs or at least didn't regress any of those 3 methods.

  3. There are set of benchmarks that appears slower than the SSE2 counter part. E.g. For GetPointerToFirstInvalidByte , there are 5 benchmarks slower, the worst being 1.29X i.e. 30% which is huge.

  4. It would be good to list all the benchmarks as shown for SSE2 (along with full names so we would know which scenario is slower) for AdvSimd in your "All 3 PRs merged together" section above. From what you have shared, there are couple of benchmarks for which we see 15%~24% regression.

The first line of action is to see if you can repro the regression on your local machine (just take the relevant code from the benchmark and experiment if you see the slowness). The next thing to do is compare the JIT code before and after your changes as @TamarChristinaArm pointed out. It might not be that something that you introduced, but possibly that those methods already run decently with existing vectorization and the operations that we are doing to find mask are turning out expensive than their counter part in default implementation.

@kunalspathak
Copy link
Member

I verified the code of TranscodeToUtf8() and it looks like the intrinsic code path should be faster. Apart from other things, default implementation advances pointer of input buffer by 4 positions while intrinsic code path advances by 8 positions. So clearly intrinsic code path should be faster. Another possibility could be that the optimized methods contribute little to the benchmarks compared to the other methods. In any case, for less distraction, it will be good if you can isolate a benchmark code (perhaps write your own) that test only the methods that you have touched and take measurements that way.

@carlossanlop
Copy link
Member Author

carlossanlop commented Jul 20, 2020

I see lot more benchmarks run on SSE2 than on AdvSimd. I can't see their full names. Have we missed reporting any benchmarks for AdvSimd above?

@kunalspathak the tool that compares the benchmark results before and after will only show in the output table the results that had a performance difference above the specified threshold. In the results I shared, this is true for the Sse2 comparison, but not for the AdvSimd results (they diff didn't reach the threshold so they are not shown).

tannergooding pushed a commit to tannergooding/runtime that referenced this pull request Jul 21, 2020
…tInvalidChar (dotnet#39050)

* AdvSimd support for System.Text.Unicode.Utf16Utility.GetPointerToFirstInvalidChar

* Move using directive outside #if.
Improve Arm64MoveMask.

* Change overloads

* UIn64 in Arm64MoveMask

* Build error implicit conversion fix

* Rename method and use simpler version

* Use ShiftRightArithmetic instead of CompareEqual + And.

* Remove unnecessary comment

* Add missing shims causing Linux build to fail
@carlossanlop carlossanlop deleted the ARM-Utf16Utility.Validation branch July 22, 2020 01:21
@carlossanlop
Copy link
Member Author

I created 3 benchmark tests that touch the code from each of my 3 PRs. You can see the code in this commit in my fork.

I ran these tests in my x64 PC first, with SSE2 disabled then enabled:

❯ dotnet run --base C:\Users\calope\Desktop\sse2_off --diff C:\Users\calope\Desktop\sse2_on --threshold 0.00001%
summary:
better: 2, geomean: 1.080
total diff: 2

No Slower results for the provided threshold = 0.00001% and noise filter = 0.3ns.

| Faster                                                                           | base/diff | Base Median (ns) | Diff Median (ns) | Modality|
| -------------------------------------------------------------------------------- | ---------:| ----------------:| ----------------:| --------:|
| System.Text.Experimental.Perf_Intrinsics.Utf8Utility_GetPointerToFirstInvalidByt |      1.16 |            80.35 |            69.53 |         |
| System.Text.Experimental.Perf_Intrinsics.Utf16Utility_GetPointerToFirstInvalidCh |      1.01 |            73.23 |            72.50 |         |

Then I ran them in my ARM64 WSL2 VM, with AdvSimd disabbled then enabled:

calope@calopearm:~/performance/src/tools/ResultsComparer$ dotnet run --base /home/calope/arm64_off/ --diff /home/calope/arm64_on/ --threshold 0.0001%
summary:
worse: 1, geomean: 1.176
total diff: 1

| Slower                                                                           | diff/base | Base Median (ns) | Diff Median (ns) | Modality|
| -------------------------------------------------------------------------------- | ---------:| ----------------:| ----------------:| --------:|
| System.Text.Experimental.Perf_Intrinsics.Utf16Utility_GetPointerToFirstInvalidCh |      1.18 |           250.31 |           294.48 |         |

No Faster results for the provided threshold = 0.0001% and noise filter = 0.3ns.

If a test didn't show up in the table, it's because there was no difference above the specified threshold.

@kunalspathak @pgovind @eiriktsarpalis @tannergooding @jeffhandley

tannergooding added a commit that referenced this pull request Jul 22, 2020
…#39738)

* AdvSimd support for System.Text.Unicode.Utf16Utility.GetPointerToFirstInvalidChar (#39050)

* AdvSimd support for System.Text.Unicode.Utf16Utility.GetPointerToFirstInvalidChar

* Move using directive outside #if.
Improve Arm64MoveMask.

* Change overloads

* UIn64 in Arm64MoveMask

* Build error implicit conversion fix

* Rename method and use simpler version

* Use ShiftRightArithmetic instead of CompareEqual + And.

* Remove unnecessary comment

* Add missing shims causing Linux build to fail

* AdvSimd support for System.Text.Unicode.Utf8Utility.TranscodeToUtf8 (#39041)

* AdvSimd support for System.Text.Unicode.Utf8Utility.TranscodeToUtf8

* Readd using to prevent build failure.
Add AdvSimd equivalent operation to TestZ.

* Inverted condition

* Address IsSupported order, improve use ExtractNarrowingSaturated usage

* Rename source to result, second argument utf16Data

* Improve CompareTest

* Add shims causing failures in Linux

* Use unsigned version of ExtractNarrowingSaturate, avoid using MinAcross and use MaxPairwise instead

* Missing support check for Sse2.X64

* Add missing case for AdvSimd

* Use MinPairwise for short

* AdvSimd support for System.Text.Unicode.Utf8Utility.GetPointerToFirstInvalidByte (#38653)

* AdvSimd support for System.Text.Unicode.Utf8Utility.GetPointerToFirstInvalidByte

* Move comment to the top, add shims.

* Little endian checks

* Use custom MoveMask method for AdvSimd

* Address suggestions to improve the AdvSimdMoveMask method

* Define initialMask outside MoveMask method

* UInt64 in Arm64MoveMask

* Add unit test case to verify intrinsics improvement

* Avoid casting to smaller integer type

* Typo and comment

* Use ShiftRightArithmetic instead of CompareEqual + And.
Remove test case causing other unit tests to fail.

* Use AddPairwise version of GetNotAsciiBytes

* Add missing shims causing Linux build to fail

* Simplify GetNonAsciiBytes to only one AddPairwise call, shorter bitmask

* Respect data type returned by masking method

* Address suggestions - assert trailingzerocount and bring back uint mask

* Trailing zeroes in AdvSimd need to be divided by 4, and total number should not be larger than 16

* Avoid declaring static field which causes PNSE in Utf8String.Experimental (S.P.Corelib code is used for being NetStandard)

* Prefer using nuint for BitConverter.TrailingZeroCount

* Fix build failure in net472 debug AdvSimd Utf16Utility (#39652)

Co-authored-by: Carlos Sanchez Lopez <1175054+carlossanlop@users.noreply.github.com>
@kunalspathak
Copy link
Member

I created 3 benchmark tests that touch the code from each of my 3 PRs. You can see the code in this commit in my fork.

Thanks @carlossanlop for writing these benchmarks. Few points looking at the benchmark code:

  1. Utf8Utility_GetPointerToFirstInvalidByte : This benchmark has non-ascii bytes at index 0 and 1 of input stream. So your PR code will just find it in 1st iteration and exit. Probably you need variation of positions at which non-ascii bytes are present to know if repeatedly doing NonAsciiBytes() (on input that has all ascii bytes) benefit or not.

  2. Utf16Utility_GetPointerToFirstInvalidChar and Utf8Utility_TranscodeToUtf8 - Same comment. I am not sure if the input in your benchmark will stress all the branches in your PR code. Probably worth thinking on introducing some variation in input.

For SSE2, same observation as earlier. We see some improvement for few benchmarks but other remains stable (they don't regress). However, for AdvSimd, we don't see any benchmark getting faster but in contrast, we regress some benchmarks. With that I am not sure if intrinsifying those methods with ARM64 intrinsics is worth doing.

@TamarChristinaArm , @BruceForstall - thoughts?

@TamarChristinaArm
Copy link
Contributor

@kunalspathak Part of it is that the code for Arm64 seems to be a literal translation of the SSE one. This introduces a couple of inefficiencies.

A couple of examples (without taking a look at the algorithm itself but just the intrinsics usage):

  1. The first is this call to GetNonAsciiBytes https://github.com/carlossanlop/runtime/blob/1e2e3da61eb3ad2f7e51909b5ee8d20985fd5eb1/src/libraries/System.Private.CoreLib/src/System/Text/Unicode/Utf16Utility.Validation.cs#L148 which is preceded by a call to popcount uint popcnt = (uint)BitOperations.PopCount(mask);

This has several problems, GetNonAsciiBytes reduces the value and moves it from SIMD to GENREG only to do a popcount on it.
The popcount only works on SIMD side so the implementation

Vector64<uint> input = Vector64.CreateScalar(value);
moves it back to the SIMD side to do the popcount and then moves it again back to the GENREG after a very expensive ADDV.

So you hit a reduction in every iteration. As far as I can tell popcnt is only used as a counter: tempUtf8CodeUnitCountAdjustment += popcnt; so you are better off counting into a vector with vadd on the vector result of popcnt and only reducing once at the end after the loop (if you don't hit an early overflow). The same thing with surrogatePairsCountNint. But at the very least, you don't need to transfer this back to the GENREG side. keep them on the SIMD and use AddScalar during the loop.

But looking at it, the mask doesn't seem to be used for anything other than popcount. So you are essentially counting how many elements are non-ascii and don't care about the position they are in. So can't you instead do an logical right shift of 7 on the input bytes and count those? Then you don't need the popcount nor the other instructions.

  1. The helper GetNonAsciiBytes seems to use a global for the mask, s_bitMask128. This seems declared as a global constant which I am assuming gets put in a literal pool. The problem is (I am guessing but don't have the codegen) that every call of the inlined GetNonAsciiBytes will now do a load from memory. The literal pools are far away so it spends 4 instructions building the address and then a load for each call. and I think this is called 2-3 times inside each iteration of the loop?

I don't believe the JIT does any hoisting does it?

  1. The second GetNonAsciiBytes call first checks the value to be non-zero and then uses the mask. https://github.com/carlossanlop/runtime/blob/1e2e3da61eb3ad2f7e51909b5ee8d20985fd5eb1/src/libraries/System.Private.CoreLib/src/System/Text/Unicode/Utf16Utility.Validation.cs#L188

The problem here is that you have done more operations than needed if you don't enter the if block since all you care about is if there is a byte with the MSB set. If the common case is that the if is not entered you can instead just check with the SMAXP trick again (in SVE you have vector instructions that set the flags but not neon sadly).

@TamarChristinaArm
Copy link
Contributor

I haven't looked at the UTF8 implementation but I assume it's roughly the same.

@kunalspathak
Copy link
Member

Thanks for the great feedback @TamarChristinaArm !

  1. Regaring PopCount

That's a good observation. I didn't realize that SIMD -> GENREG -> SIMD is happening here.

So can't you instead do an logical right shift of 7 on the input bytes and count those? Then you don't need the popcount nor the other instructions.

Part of "logical right shift of 7" is done inside GetNonAsciiBytes() but my understanding is that we do need the mask further to check for surrogate. It could be that the need is just because we are using a mask that was written for SSE2 and can be updated to take advantage from what simplified GetNonAsciiBytes() would return.

  1. The helper GetNonAsciiBytes seems to use a global for the mask, s_bitMask128.

The latest code doesn't use static variable. Not because JIT doesn't optimize it but some other failures that we see in code that consumes the file for .netstandard. JIT does optimize readonly static. Related conversation about this: #38653 (comment)

I think overall that's a good feedback that @carlossanlop should consider looking into to eliminate the regressions.

@TamarChristinaArm
Copy link
Contributor

Part of "logical right shift of 7" is done inside GetNonAsciiBytes() but my understanding is that we do need the mask further to check for surrogate. It could be that the need is just because we are using a mask that was written for SSE2 and can be updated to take advantage from what simplified GetNonAsciiBytes() would return.

@kunalspathak well it does an arithmetic right shift so the values in each lane are either 0x00 or 0xff so it can mask out the indices. But what I was getting at is that by only using it as a popcount value you no longer really care about the positions of the non-ascii values but only the number of them. So you don't need the mask etc for that case, just a logical right shift which would leave you with lanes either 0x00 or 0x1.

So what I mean is, using the helper function makes 2 of the 3 cases it's being used do more work than they have to. Only 1 case actually uses the positional information.

@kunalspathak
Copy link
Member

@kunalspathak well it does an arithmetic right shift so the values in each lane are either 0x00 or 0xff so it can mask out the indices. But what I was getting at is that by only using it as a popcount value you no longer really care about the positions of the non-ascii values but only the number of them. So you don't need the mask etc for that case, just a logical right shift which would leave you with lanes either 0x00 or 0x1.

I see what you mean. Thanks for the clarification. So with your earlier suggestion, we can just use AddScalar on the vectors that has 0x00 and 0x1 to accumulate how many lanes were non-ascii and then, outside the loop, use AddAcross to essentially get the overall count of non-ascii bytes we have seen in the entire input and set that in tempUtf8CodeUnitCountAdjustment.

So what I mean is, using the helper function makes 2 of the 3 cases it's being used do more work than they have to. Only 1 case actually uses the positional information.

Yes, I also now see your argument about the case where we just check for if (mask != 0) and we can optimize the case when mask == 0 by using SMAXP trick.

@TamarChristinaArm
Copy link
Contributor

I see what you mean. Thanks for the clarification. So with your earlier suggestion, we can just use AddScalar on the vectors that has 0x00 and 0x1 to accumulate how many lanes were non-ascii and then, outside the loop, use AddAcross to essentially get the overall count of non-ascii bytes we have seen in the entire input and set that in tempUtf8CodeUnitCountAdjustment.

Yup, you probably want to reduce the value so that you get a bigger range. If you do the add on the value after the shift, then it's a byte and you'd overflow after 255 non-ascii characters in the same position. You can use ADDP to reduce it to two longs or 4 ints and use vadd on those during the loop.

Now on architectures that support it you have a quick way to go from byte to int which is using the dot-product instructions.

USDOT with the last vector being a vectors of ones will get you a widening sum from byte to int. Which saves you from needing two addp. But that's an additional enhancement just FYI.

@kunalspathak
Copy link
Member

kunalspathak commented Aug 4, 2020

@carlossanlop - any update on the perf regression after intrinsifying the methods?

FYI - @JulieLeeMSFT , @jeffhandley

@TamarChristinaArm
Copy link
Contributor

@kunalspathak @carlossanlop Which reminds me.. Wilco (the one who does the work on Arm optimized's string routine) mentioned that the

        [MethodImpl(MethodImplOptions.AggressiveInlining)]
        private static uint GetNonAsciiBytes(Vector128<byte> value, Vector128<byte> bitMask128)
        {
            Debug.Assert(AdvSimd.Arm64.IsSupported);

            Vector128<byte> mostSignificantBitIsSet = AdvSimd.ShiftRightArithmetic(value.AsSByte(), 7).AsByte();
            Vector128<byte> extractedBits = AdvSimd.And(mostSignificantBitIsSet, bitMask128);

            // self-pairwise add until all flags have moved to the first two bytes of the vector
            extractedBits = AdvSimd.Arm64.AddPairwise(extractedBits, extractedBits);
            extractedBits = AdvSimd.Arm64.AddPairwise(extractedBits, extractedBits);
            extractedBits = AdvSimd.Arm64.AddPairwise(extractedBits, extractedBits);
            return extractedBits.AsUInt16().ToScalar();
        }

Can be made much more efficient by using a 0xF00F mask instead of the 0x80402010_08040201 currently used (See https://github.com/ARM-software/optimized-routines/blob/master/string/aarch64/memrchr.S#L55)

This mask has several advantages in that it can be created without needing a literal pool. it's endian agnostic and also only requires a single addp to compress the data.

it works by essentially doing

uint16_t elem = 0xF00F;
uint16x8_t mask = vdupq_n_u16 (elem);
uint8x16_t vmask = vreinterpretq_u8_u16 (mask);

And you and that mask with the one from the arithmetic right shift.

The mask gives you alternating sequences of 0xF0, 0x0F, 0xF0, 0x0F, ... and essentially leaves a slot open for the addp.

after a single addp you'll get values in each lane of 0x00, 0xFF, 0xF0 and 0x0F which is the same as you would have gotten with the current mask after 3 ADDPs where each bit in the original mask would have occupied 4 in the final one after 3 addp.

So I think it's best to separate the Arm64 one from the SSE one since a different approach should yield much better results.

@kunalspathak
Copy link
Member

kunalspathak commented Aug 7, 2020

I created 3 benchmark tests that touch the code from each of my 3 PRs. You can see the code in this commit in my fork.

@carlossanlop - It will be good to submit a PR (after you address the feedback I have given here) to dotnet/performance repo so we can track its progress and compare it against .NET 3.1.

@carlossanlop
Copy link
Member Author

carlossanlop commented Aug 19, 2020

@kunalspathak here's a followup:

I found that the Utf8Span.ToCharArray method, using the appropriate string, touches the code I introduced in my 3 ARM PRs (this PR, #39041 and #38653).

@pgovind wrote this benchmark class in which he tests this method against several large files with different encodings.

I decided to test the performance of my code, making sure to revert all of @pgovind 's ARM changes so there was no overlap/noise (his ARM improvements are closely related to mine).

There were no slower tests, but not the same number of improvements as in Sse2.

I also made sure to run all the rest of the benchmark methods in that class.

These are the results:

ToCharArray

Sse2 Off vs Sse2 On (x64 PC)
```
❯ dotnet run -c release --base C:\users\carlos\Desktop\sse_off --diff C:\users\carlos\Desktop\sse_on --threshold 0.001%
summary:
better: 15, geomean: 1.265
total diff: 15

No Slower results for the provided threshold = 0.001% and noise filter = 0.3ns.

| Faster                                                              | base/diff | Base Median (ns) | Diff Median (ns) | Modality|
| ------------------------------------------------------------------- | ---------:| ----------------:| ----------------:| --------:|
| System.Text.Perf_Utf8String.ToCharArray(Input: EnglishAllAscii)     |      1.68 |         92597.23 |         54985.47 |         |
| System.Text.Perf_Utf8String.ToCharArray(Input: EnglishMostlyAscii)  |      1.21 |        184600.81 |        152344.54 |         |
| System.Text.Perf_Utf8String.ToCharArray(Input: Cyrillic)            |      1.09 |        196266.91 |        180002.95 |         |
| System.Text.Perf_Utf8String.ToCharArray(Input: Greek)               |      1.06 |        303606.07 |        286715.45 |         |
| System.Text.Perf_Utf8String.ToCharArray(Input: Chinese)             |      1.05 |        228964.39 |        218493.01 |         |
```
AdvSimd off vs AdvSimd on (Surface Pro X)
```
root@calopearm:/home/calope/performance/src/tools/ResultsComparer# dn run -c release --base /home/calope/advsimd_off/ --diff /home/calope/advsimd_on/ --threshold 0.001%
summary:
better: 8, geomean: 1.578
total diff: 8

No Slower results for the provided threshold = 0.001% and noise filter = 0.3ns.

| Faster                                                           | base/diff | Base Median (ns) | Diff Median (ns) | Modality|
| ---------------------------------------------------------------- | ---------:| ----------------:| ----------------:| -------- |
| System.Text.Perf_Utf8String.ToCharArray(Input: Chinese)          |      1.65 |        870580.95 |        528188.89 | several?|
| System.Text.Perf_Utf8String.ToCharArray(Input: EnglishAllAscii)  |      1.63 |        353382.15 |        216978.86 | several?|
```

All benchmarks

Sse2 Off vs Sse2 On (x64 PC)
```
❯ dotnet run -c release --base C:\users\carlos\Desktop\sse_off --diff C:\users\carlos\Desktop\sse_on --threshold 0.001%
summary:
better: 15, geomean: 1.265
total diff: 15

No Slower results for the provided threshold = 0.001% and noise filter = 0.3ns.

| Faster                                                              | base/diff | Base Median (ns) | Diff Median (ns) | Modality|
| ------------------------------------------------------------------- | ---------:| ----------------:| ----------------:| --------:|
| System.Text.Perf_Utf8String.ToChars(Input: EnglishAllAscii)         |      4.51 |         47471.62 |         10518.33 |         |
| System.Text.Perf_Utf8String.ToCharArray(Input: EnglishAllAscii)     |      1.68 |         92597.23 |         54985.47 |         |
| System.Text.Perf_Utf8String.IsNormalized(Input: EnglishAllAscii)    |      1.49 |        190083.06 |        127896.98 |         |
| System.Text.Perf_Utf8String.ToChars(Input: EnglishMostlyAscii)      |      1.31 |         88954.94 |         67716.30 |         |
| System.Text.Perf_Utf8String.ToCharArray(Input: EnglishMostlyAscii)  |      1.21 |        184600.81 |        152344.54 |         |
| System.Text.Perf_Utf8String.ToChars(Input: Chinese)                 |      1.12 |        142434.77 |        127030.44 |         |
| System.Text.Perf_Utf8String.ToChars(Input: Cyrillic)                |      1.11 |        103290.91 |         93199.39 |         |
| System.Text.Perf_Utf8String.ToCharArray(Input: Cyrillic)            |      1.09 |        196266.91 |        180002.95 |         |
| System.Text.Perf_Utf8String.IsNormalized(Input: EnglishMostlyAscii) |      1.08 |        430083.53 |        399830.78 |         |
| System.Text.Perf_Utf8String.ToChars(Input: Greek)                   |      1.07 |        161114.76 |        149981.74 |         |
| System.Text.Perf_Utf8String.ToCharArray(Input: Greek)               |      1.06 |        303606.07 |        286715.45 |         |
| System.Text.Perf_Utf8String.ToCharArray(Input: Chinese)             |      1.05 |        228964.39 |        218493.01 |         |
| System.Text.Perf_Utf8String.IsNormalized(Input: Cyrillic)           |      1.04 |        380897.69 |        367506.40 |         |
| System.Text.Perf_Utf8String.IsNormalized(Input: Chinese)            |      1.03 |        446214.11 |        435219.18 |         |
| System.Text.Perf_Utf8String.IsNormalized(Input: Greek)              |      1.02 |        561548.33 |        550692.19 |         |
```
AdvSimd off vs AdvSimd on (Surface Pro X)
```
root@calopearm:/home/calope/performance/src/tools/ResultsComparer# dn run -c release --base /home/calope/advsimd_off/ --diff /home/calope/advsimd_on/ --threshold 0.001%
summary:
better: 8, geomean: 1.578
total diff: 8

No Slower results for the provided threshold = 0.001% and noise filter = 0.3ns.

| Faster                                                           | base/diff | Base Median (ns) | Diff Median (ns) | Modality|
| ---------------------------------------------------------------- | ---------:| ----------------:| ----------------:| -------- |
| System.Text.Perf_Utf8String.ToChars(Input: EnglishAllAscii)      |      3.86 |        210886.70 |         54637.39 |         |
| System.Text.Perf_Utf8String.ToCharArray(Input: Chinese)          |      1.65 |        870580.95 |        528188.89 | several?|
| System.Text.Perf_Utf8String.ToCharArray(Input: EnglishAllAscii)  |      1.63 |        353382.15 |        216978.86 | several?|
| System.Text.Perf_Utf8String.IsNormalized(Input: EnglishAllAscii) |      1.62 |        819880.59 |        506121.03 |         |
| System.Text.Perf_Utf8String.ToChars(Input: Chinese)              |      1.35 |        432706.83 |        320995.57 |         |
| System.Text.Perf_Utf8String.IsAscii(Input: EnglishAllAscii)      |      1.23 |         37475.89 |         30571.02 | bimodal |
| System.Text.Perf_Utf8String.IsAscii(Input: Greek)                |      1.20 |           173.74 |           144.50 |         |
| System.Text.Perf_Utf8String.IsAscii(Input: Chinese)              |      1.15 |           140.35 |           121.52 | several?|
```

Some more details on how the new code is reached:

The EnglishMostlyAscii.txt file gets to reach these breakpoints:

  • Utf16Utility.Validation.cs: line 82 via the Utf8String() constructor
  • Utf8Utility.Transcoding.cs: lines 886 944 1035 via the Utf8String() constructor
  • Utf8Utility.Validation.cs: line 135 via the call to Utf8Span.ToCharArray()

The Greek.txt file is very similar:

  • Utf16Utility.Validation.cs: lines 82 150 via the Utf8String() constructor
  • Utf8Utility.Transcoding.cs: lines 886 944 via the Utf8String() constructor
  • Utf8Utility.Validation.cs: line 135 via Utf8Span.ToCharArray()

@adamsitnik
Copy link
Member

@carlossanlop thanks for sharing full results! the improvements are impressive!

@kunalspathak
Copy link
Member

kunalspathak commented Aug 19, 2020

Thanks @carlossanlop for gathering the numbers. As we discussed offline, the benchmark code touches the methods you optimized in #39041 and #38653. Utf16Utility.GetFirstInvalidChar() is only called through new Utf8String() that happens inside Setup() and not the benchmark code itself. Perhaps, add another benchmark to do new Utf8Encoding.GetByteCount(). It will hit Utf16Utility.GetFirstInvalidChar().

Update: The numbers look great and are expected from the work in #39041 and #38653. It will be interesting to see the benchmark numbers for this PR because this is the one has longer implementation to GetNonAsciiBytes().

@carlossanlop
Copy link
Member Author

Please take a look at this PR in the performance repo with a new benchmark I added to test Utf16Utility.GetFirstInvalidChar(). The results are posted there: dotnet/performance#1466 (comment)

@ghost ghost locked as resolved and limited conversation to collaborators Dec 8, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
arch-arm64 area-System.Runtime.Intrinsics utf8-impact Potentially impacts UTF-8 support in the runtime
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

10 participants