Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove BMI2 intrinsics from ASCII and UTF-8 transcoding logic #31904

Merged
merged 4 commits into from Feb 11, 2020

Conversation

@GrabYourPitchforks
Copy link
Member

GrabYourPitchforks commented Feb 7, 2020

Contributes to #2251. For the most part, BMI2 instructions have been replaced with SIMD instructions where possible. Where the replacement is not easily doable the BMI2 instruction was removed entirely. Performance chart follows.

BenchmarkDotNet=v0.12.0, OS=Windows 10.0.18363
Intel Core i7-6700 CPU 3.40GHz (Skylake), 1 CPU, 8 logical and 4 physical cores
.NET Core SDK=5.0.100-alpha1-015755
  [Host]     : .NET Core 5.0.0 (CoreCLR 5.0.19.56303, CoreFX 5.0.19.56306), X64 RyuJIT
  Job-RRFWFP : .NET Core 5.0.0 (CoreCLR 42.42.42.42424, CoreFX 42.42.42.42424), X64 RyuJIT
  Job-HRZKDM : .NET Core 5.0.0 (CoreCLR 42.42.42.42424, CoreFX 42.42.42.42424), X64 RyuJIT
Method Toolchain Corpus Mean Error StdDev Ratio RatioSD
TranscodeUtf16ToUtf8 master 11-0.txt 96.63 us 1.836 us 1.718 us 1.00 0.00
TranscodeUtf16ToUtf8 proto 11-0.txt 93.03 us 1.246 us 1.104 us 0.96 0.03
TranscodeUtf8ToUtf16 master 11-0.txt 101.08 us 0.756 us 0.632 us 1.00 0.00
TranscodeUtf8ToUtf16 proto 11-0.txt 95.03 us 1.541 us 1.442 us 0.94 0.02
TranscodeUtf16ToUtf8 master 11.txt 13.22 us 0.092 us 0.086 us 1.00 0.00
TranscodeUtf16ToUtf8 proto 11.txt 12.99 us 0.240 us 0.224 us 0.98 0.02
TranscodeUtf8ToUtf16 master 11.txt 14.01 us 0.197 us 0.185 us 1.00 0.00
TranscodeUtf8ToUtf16 proto 11.txt 14.17 us 0.148 us 0.139 us 1.01 0.01
TranscodeUtf16ToUtf8 master 25249-0.txt 127.28 us 1.353 us 1.199 us 1.00 0.00
TranscodeUtf16ToUtf8 proto 25249-0.txt 136.48 us 1.825 us 1.707 us 1.07 0.02
TranscodeUtf8ToUtf16 master 25249-0.txt 172.26 us 1.118 us 0.873 us 1.00 0.00
TranscodeUtf8ToUtf16 proto 25249-0.txt 171.87 us 3.413 us 5.607 us 1.01 0.04
TranscodeUtf16ToUtf8 master 30774-0.txt 123.48 us 0.711 us 0.630 us 1.00 0.00
TranscodeUtf16ToUtf8 proto 30774-0.txt 124.58 us 2.422 us 2.487 us 1.01 0.02
TranscodeUtf8ToUtf16 master 30774-0.txt 125.91 us 1.965 us 1.838 us 1.00 0.00
TranscodeUtf8ToUtf16 proto 30774-0.txt 124.47 us 2.350 us 2.198 us 0.99 0.02
TranscodeUtf16ToUtf8 master 39251-0.txt 200.99 us 3.825 us 4.252 us 1.00 0.00
TranscodeUtf16ToUtf8 proto 39251-0.txt 200.85 us 3.872 us 3.622 us 1.00 0.02
TranscodeUtf8ToUtf16 master 39251-0.txt 202.09 us 1.374 us 1.285 us 1.00 0.00
TranscodeUtf8ToUtf16 proto 39251-0.txt 205.34 us 2.193 us 2.051 us 1.02 0.01

Most of the performance characteristics are approximately the same. There's a small perf gain in the Alice in Wonderland text (11-0.txt) for transcoding each direction. There's a small perf loss in the CJK sample (25249-0.txt).

Marked as no-merge because I haven't finished fuzzing it yet. It'll take a few days for the run to finish.

Note to reviewers: hiding whitespace changes will make this easier to review.

@tannergooding

This comment has been minimized.

Copy link
Member

tannergooding commented Feb 7, 2020

I'll run against both of my AMD boxes (1800X and 3900X) later today and will share back results.

@tannergooding

This comment has been minimized.

Copy link
Member

tannergooding commented Feb 8, 2020

Numbers for the Ryzen 9 3900X

Baseline, BMI Enabled

BenchmarkDotNet=v0.12.0, OS=Windows 10.0.18363
AMD Ryzen 7 1800X, 1 CPU, 16 logical and 8 physical cores
.NET Core SDK=5.0.100-preview.1.20106.4
  [Host]     : .NET Core 3.1.1 (CoreCLR 4.700.19.60701, CoreFX 4.700.19.60801), X64 RyuJIT
  DefaultJob : .NET Core 3.1.1 (CoreCLR 4.700.19.60701, CoreFX 4.700.19.60801), X64 RyuJIT

COMPlus_BMIEnabled=1

|               Method |      Corpus |         Mean |      Error |     StdDev |
|--------------------- |------------ |-------------:|-----------:|-----------:|
| TranscodeUtf16ToUtf8 |    11-0.txt | 1,447.621 us |  6.5875 us |  5.8397 us |
| TranscodeUtf8ToUtf16 |    11-0.txt | 1,558.368 us |  9.2168 us |  8.6214 us |
| TranscodeUtf16ToUtf8 |      11.txt |     7.884 us |  0.1195 us |  0.1118 us |
| TranscodeUtf8ToUtf16 |      11.txt |     9.009 us |  0.1282 us |  0.1136 us |
| TranscodeUtf16ToUtf8 | 25249-0.txt |   259.336 us |  4.4757 us |  4.1866 us |
| TranscodeUtf8ToUtf16 | 25249-0.txt | 1,255.214 us | 16.1676 us | 15.1232 us |
| TranscodeUtf16ToUtf8 | 30774-0.txt |   273.833 us |  2.7463 us |  2.5689 us |
| TranscodeUtf8ToUtf16 | 30774-0.txt |   291.992 us |  1.3660 us |  1.2777 us |
| TranscodeUtf16ToUtf8 | 39251-0.txt |   325.948 us |  4.5692 us |  4.2740 us |
| TranscodeUtf8ToUtf16 | 39251-0.txt |   341.888 us |  3.3136 us |  3.0996 us |

Baseline, BMI Disabled

BenchmarkDotNet=v0.12.0, OS=Windows 10.0.18363
AMD Ryzen 7 1800X, 1 CPU, 16 logical and 8 physical cores
.NET Core SDK=5.0.100-preview.1.20106.4
  [Host]     : .NET Core 3.0.2 (CoreCLR 4.700.19.57202, CoreFX 4.700.19.57708), X64 RyuJIT
  DefaultJob : .NET Core 3.0.2 (CoreCLR 4.700.19.57202, CoreFX 4.700.19.57708), X64 RyuJIT

COMPlus_BMIEnabled=0

|               Method |      Corpus |       Mean |     Error |    StdDev |
|--------------------- |------------ |-----------:|----------:|----------:|
| TranscodeUtf16ToUtf8 |    11-0.txt |  80.301 us | 0.6198 us | 0.5176 us |
| TranscodeUtf8ToUtf16 |    11-0.txt |  81.718 us | 0.1675 us | 0.1485 us |
| TranscodeUtf16ToUtf8 |      11.txt |   8.651 us | 0.0871 us | 0.0772 us |
| TranscodeUtf8ToUtf16 |      11.txt |   8.888 us | 0.0444 us | 0.0416 us |
| TranscodeUtf16ToUtf8 | 25249-0.txt |  87.819 us | 0.4431 us | 0.4145 us |
| TranscodeUtf8ToUtf16 | 25249-0.txt | 111.830 us | 1.0943 us | 1.0236 us |
| TranscodeUtf16ToUtf8 | 30774-0.txt |  77.471 us | 0.5238 us | 0.4900 us |
| TranscodeUtf8ToUtf16 | 30774-0.txt |  81.581 us | 0.8401 us | 0.7858 us |
| TranscodeUtf16ToUtf8 | 39251-0.txt | 127.147 us | 0.7287 us | 0.6816 us |
| TranscodeUtf8ToUtf16 | 39251-0.txt | 128.445 us | 0.7691 us | 0.7194 us |

This PR

BenchmarkDotNet=v0.12.0, OS=Windows 10.0.18363
AMD Ryzen 7 1800X, 1 CPU, 16 logical and 8 physical cores
.NET Core SDK=5.0.100-preview.1.20106.4
  [Host]     : .NET Core 5.0.0 (CoreCLR 5.0.20.10310, CoreFX 5.0.20.10310), X64 RyuJIT
  Job-ENPTWT : .NET Core 5.0.0 (CoreCLR 42.42.42.42424, CoreFX 42.42.42.42424), X64 RyuJIT

Toolchain=proto

|               Method |      Corpus |       Mean |     Error |    StdDev |
|--------------------- |------------ |-----------:|----------:|----------:|
| TranscodeUtf16ToUtf8 |    11-0.txt |  63.033 us | 0.2028 us | 0.1897 us |
| TranscodeUtf8ToUtf16 |    11-0.txt |  65.720 us | 0.1721 us | 0.1525 us |
| TranscodeUtf16ToUtf8 |      11.txt |   7.004 us | 0.0287 us | 0.0254 us |
| TranscodeUtf8ToUtf16 |      11.txt |   8.661 us | 0.0428 us | 0.0400 us |
| TranscodeUtf16ToUtf8 | 25249-0.txt |  85.969 us | 0.6275 us | 0.5870 us |
| TranscodeUtf8ToUtf16 | 25249-0.txt | 111.889 us | 0.3453 us | 0.3230 us |
| TranscodeUtf16ToUtf8 | 30774-0.txt |  79.330 us | 0.6134 us | 0.5737 us |
| TranscodeUtf8ToUtf16 | 30774-0.txt |  82.454 us | 0.7204 us | 0.6739 us |
| TranscodeUtf16ToUtf8 | 39251-0.txt | 123.258 us | 1.3996 us | 1.3092 us |
| TranscodeUtf8ToUtf16 | 39251-0.txt | 129.819 us | 2.4805 us | 2.6541 us |
@tannergooding

This comment has been minimized.

Copy link
Member

tannergooding commented Feb 8, 2020

I'll run 3900X results tonight and post, as that box my personal one at home.

@GrabYourPitchforks

This comment has been minimized.

Copy link
Member Author

GrabYourPitchforks commented Feb 8, 2020

The fuzzing run is now in progress. The fuzzer is running over the combination of this PR with #32036 cherry-picked.

For reference, the fuzzer target application is https://github.com/GrabYourPitchforks/utf8fuzz. There are two runs going in parallel: one with full intrinsic enabled, and one with SSE4.1+ disabled.

@tannergooding

This comment has been minimized.

Copy link
Member

tannergooding commented Feb 10, 2020

Numbers for the Ryzen 9 3900X

Baseline, BMI Enabled

BenchmarkDotNet=v0.12.0, OS=Windows 10.0.18363
AMD Ryzen 9 3900X, 1 CPU, 24 logical and 12 physical cores
.NET Core SDK=5.0.100-preview.1.20110.2
  [Host]     : .NET Core 3.1.1 (CoreCLR 4.700.19.60701, CoreFX 4.700.19.60801), X64 RyuJIT
  DefaultJob : .NET Core 3.1.1 (CoreCLR 4.700.19.60701, CoreFX 4.700.19.60801), X64 RyuJIT

COMPlus_BMIEnabled=1

|               Method |      Corpus |         Mean |     Error |    StdDev |
|--------------------- |------------ |-------------:|----------:|----------:|
| TranscodeUtf16ToUtf8 |    11-0.txt | 1,234.840 us | 5.0549 us | 4.7284 us |
| TranscodeUtf8ToUtf16 |    11-0.txt | 1,310.632 us | 6.8465 us | 6.4042 us |
| TranscodeUtf16ToUtf8 |      11.txt |     6.361 us | 0.0245 us | 0.0229 us |
| TranscodeUtf8ToUtf16 |      11.txt |     7.603 us | 0.0356 us | 0.0333 us |
| TranscodeUtf16ToUtf8 | 25249-0.txt |   213.272 us | 0.7605 us | 0.6742 us |
| TranscodeUtf8ToUtf16 | 25249-0.txt | 1,040.893 us | 2.7581 us | 2.5799 us |
| TranscodeUtf16ToUtf8 | 30774-0.txt |   237.385 us | 0.6961 us | 0.6512 us |
| TranscodeUtf8ToUtf16 | 30774-0.txt |   245.502 us | 1.3296 us | 1.2437 us |
| TranscodeUtf16ToUtf8 | 39251-0.txt |   285.015 us | 1.9253 us | 1.8009 us |
| TranscodeUtf8ToUtf16 | 39251-0.txt |   294.518 us | 1.4636 us | 1.2974 us |

Baseline, BMI Disabled

BenchmarkDotNet=v0.12.0, OS=Windows 10.0.18363
AMD Ryzen 9 3900X, 1 CPU, 24 logical and 12 physical cores
.NET Core SDK=5.0.100-preview.1.20110.2
  [Host]     : .NET Core 3.1.1 (CoreCLR 4.700.19.60701, CoreFX 4.700.19.60801), X64 RyuJIT
  DefaultJob : .NET Core 3.1.1 (CoreCLR 4.700.19.60701, CoreFX 4.700.19.60801), X64 RyuJIT

COMPlus_BMIEnabled=0

|               Method |      Corpus |       Mean |     Error |    StdDev |
|--------------------- |------------ |-----------:|----------:|----------:|
| TranscodeUtf16ToUtf8 |    11-0.txt |  64.385 us | 0.3538 us | 0.3309 us |
| TranscodeUtf8ToUtf16 |    11-0.txt |  67.793 us | 1.0771 us | 0.9548 us |
| TranscodeUtf16ToUtf8 |      11.txt |   6.406 us | 0.0216 us | 0.0202 us |
| TranscodeUtf8ToUtf16 |      11.txt |   7.467 us | 0.0506 us | 0.0448 us |
| TranscodeUtf16ToUtf8 | 25249-0.txt |  66.358 us | 0.6297 us | 0.5890 us |
| TranscodeUtf8ToUtf16 | 25249-0.txt |  90.848 us | 0.4346 us | 0.4066 us |
| TranscodeUtf16ToUtf8 | 30774-0.txt |  71.887 us | 1.0717 us | 1.0025 us |
| TranscodeUtf8ToUtf16 | 30774-0.txt |  71.582 us | 1.3246 us | 1.2390 us |
| TranscodeUtf16ToUtf8 | 39251-0.txt | 115.038 us | 0.8853 us | 0.8281 us |
| TranscodeUtf8ToUtf16 | 39251-0.txt | 121.819 us | 0.7942 us | 0.7040 us |

This PR

BenchmarkDotNet=v0.12.0, OS=Windows 10.0.18363
AMD Ryzen 9 3900X, 1 CPU, 24 logical and 12 physical cores
.NET Core SDK=5.0.100-preview.1.20110.2
  [Host]     : .NET Core 5.0.0 (CoreCLR 5.0.20.10310, CoreFX 5.0.20.10310), X64 RyuJIT
  Job-WZDGVL : .NET Core 5.0.0 (CoreCLR 42.42.42.42424, CoreFX 42.42.42.42424), X64 RyuJIT

Toolchain=proto

|               Method |      Corpus |       Mean |     Error |    StdDev |
|--------------------- |------------ |-----------:|----------:|----------:|
| TranscodeUtf16ToUtf8 |    11-0.txt |  55.192 us | 0.5790 us | 0.5416 us |
| TranscodeUtf8ToUtf16 |    11-0.txt |  55.675 us | 0.6835 us | 0.6393 us |
| TranscodeUtf16ToUtf8 |      11.txt |   5.391 us | 0.0325 us | 0.0304 us |
| TranscodeUtf8ToUtf16 |      11.txt |   7.521 us | 0.0231 us | 0.0204 us |
| TranscodeUtf16ToUtf8 | 25249-0.txt |  65.557 us | 0.3800 us | 0.3554 us |
| TranscodeUtf8ToUtf16 | 25249-0.txt |  90.860 us | 0.4044 us | 0.3782 us |
| TranscodeUtf16ToUtf8 | 30774-0.txt |  68.609 us | 0.8211 us | 0.7681 us |
| TranscodeUtf8ToUtf16 | 30774-0.txt |  74.238 us | 0.8129 us | 0.7604 us |
| TranscodeUtf16ToUtf8 | 39251-0.txt | 114.686 us | 0.8000 us | 0.7483 us |
| TranscodeUtf8ToUtf16 | 39251-0.txt | 119.215 us | 0.8738 us | 0.8174 us |
@tannergooding

This comment has been minimized.

Copy link
Member

tannergooding commented Feb 10, 2020

Also updated the 1800X numbers above to include Baseline, BMI Disabled

@GrabYourPitchforks

This comment has been minimized.

Copy link
Member Author

GrabYourPitchforks commented Feb 10, 2020

Removed no-merge label as fuzzing run has been going all weekend with no problem.

Copy link
Member

tannergooding left a comment

Overall LGTM.

I don't think we need the assert that we are little endian on x86 and the intrinsic code could likely benefit from comments in a few places for abitrary users reading the code.

@GrabYourPitchforks GrabYourPitchforks merged commit dbafde8 into dotnet:master Feb 11, 2020
33 of 34 checks passed
33 of 34 checks passed
runtime Build #20200211.103 had test failures
Details
WIP Ready for review
Details
license/cla All CLA requirements met.
Details
runtime (Checkout) Checkout succeeded
Details
runtime (CoreCLR Product Build Linux arm release) CoreCLR Product Build Linux arm release succeeded
Details
runtime (CoreCLR Product Build Linux arm64 release) CoreCLR Product Build Linux arm64 release succeeded
Details
runtime (CoreCLR Product Build Linux x64 release) CoreCLR Product Build Linux x64 release succeeded
Details
runtime (CoreCLR Product Build Linux_musl arm64 release) CoreCLR Product Build Linux_musl arm64 release succeeded
Details
runtime (CoreCLR Product Build Linux_musl x64 release) CoreCLR Product Build Linux_musl x64 release succeeded
Details
runtime (CoreCLR Product Build OSX x64 checked) CoreCLR Product Build OSX x64 checked succeeded
Details
runtime (CoreCLR Product Build OSX x64 release) CoreCLR Product Build OSX x64 release succeeded
Details
runtime (CoreCLR Product Build Windows_NT arm release) CoreCLR Product Build Windows_NT arm release succeeded
Details
runtime (CoreCLR Product Build Windows_NT arm64 release) CoreCLR Product Build Windows_NT arm64 release succeeded
Details
runtime (CoreCLR Product Build Windows_NT x64 release) CoreCLR Product Build Windows_NT x64 release succeeded
Details
runtime (CoreCLR Product Build Windows_NT x86 release) CoreCLR Product Build Windows_NT x86 release succeeded
Details
runtime (Libraries Build Windows_NT net472 x86 Release) Libraries Build Windows_NT net472 x86 Release succeeded
Details
runtime (Mono LLVM Product Build Linux x64 debug) Mono LLVM Product Build Linux x64 debug succeeded
Details
runtime (Mono LLVM Product Build Linux x64 release) Mono LLVM Product Build Linux x64 release succeeded
Details
runtime (Mono LLVM Product Build OSX x64 debug) Mono LLVM Product Build OSX x64 debug succeeded
Details
runtime (Mono LLVM Product Build OSX x64 release) Mono LLVM Product Build OSX x64 release succeeded
Details
runtime (Mono Product Build Linux x64 debug) Mono Product Build Linux x64 debug succeeded
Details
runtime (Mono Product Build Linux x64 release) Mono Product Build Linux x64 release succeeded
Details
runtime (Mono Product Build Linux_musl x64 debug) Mono Product Build Linux_musl x64 debug succeeded
Details
runtime (Mono Product Build Linux_musl x64 release) Mono Product Build Linux_musl x64 release succeeded
Details
runtime (Mono Product Build OSX x64 debug) Mono Product Build OSX x64 debug succeeded
Details
runtime (Mono Product Build OSX x64 release) Mono Product Build OSX x64 release succeeded
Details
runtime (Mono Product Build Windows_NT x64 debug) Mono Product Build Windows_NT x64 debug succeeded
Details
runtime (Mono Product Build Windows_NT x64 release) Mono Product Build Windows_NT x64 release succeeded
Details
runtime-live-build Build #20200211.90 succeeded
Details
runtime-live-build (Build Linux x64 debug RuntimeFlavor_Mono) Build Linux x64 debug RuntimeFlavor_Mono succeeded
Details
runtime-live-build (Build Linux x64 debug Runtime_Release) Build Linux x64 debug Runtime_Release succeeded
Details
runtime-live-build (Build OSX x64 release Runtime_Debug) Build OSX x64 release Runtime_Debug succeeded
Details
runtime-live-build (Build Windows_NT x86 release Runtime_Debug) Build Windows_NT x86 release Runtime_Debug succeeded
Details
runtime-live-build (Checkout) Checkout succeeded
Details
@GrabYourPitchforks GrabYourPitchforks deleted the GrabYourPitchforks:utf8_simd branch Feb 11, 2020
GrabYourPitchforks added a commit to GrabYourPitchforks/coreclr that referenced this pull request Feb 11, 2020
Remove BMI2 from ASCII and UTF-16 processing hot paths, as not all processors have optimized implementations of pext/pdep
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

None yet

2 participants
You can’t perform that action at this time.