Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Light up core ASCII.Utility methods with Vector256/Vector512 code paths. #88532

Merged
merged 9 commits into from Jul 17, 2023

Conversation

anthonycanino
Copy link
Contributor

@anthonycanino anthonycanino commented Jul 7, 2023

This PR lights up some code path in ASCII.Utility with Vector256/Vector512 code, namely, NarrowUtf16ToAscii, WidenAsciiToUtf16, GetIndexOfFirstNonAsciiChar, and GetIndexOfFirstNonAsciiByte.

For the GetIndexOfMethods, we have implemented the simpler, existing "default" code path but with the explicty VectorXX apis; for the Narrow/Widen methods, we have implemented the more complex SSE/Vector256 path but with the Vector256/Vector512 APIs. Right now, both are a slight tradeoff in terms of code complexity/performance.

We are open to adjusting the implementation style for either path. Perf numbers coming soon.

Perf

Please see the next section for the raw data collected from the utf8 case for System.Text.Tests.Perf_Encoding. I have formatted it a bit here to make it slightly easier to draw conclusions. Essentially, I have run with a base, run with Enable_AVX512F=1 for the vector512 path, and with Enable_AVX512F=0 for the vector256 path.

The micro is run with...

runtime\dotnet.cmd run -c Release -f net8.0 --filter System.Text.Tests.Perf_Encoding.* --corerun runtime-base\artifacts\bin\testhost\net8.0-windows-Release-x64\shared\Microsoft.NETCore.App\8.0.0\corerun.exe runtime\artifacts\bin\testhost\net8.0-windows-Release-x64\shared\Microsoft.NETCore.App\8.0.0\corerun.exe

I have added some additional data sizes and an additional micro GetCharCount which simply invokes enc.GetCharCount analogous to GetByteCount. I call out some speedup with green, and some slowdown with orange.

GetBytes

image

GetChars

image

GetByteCount

image

GetCharCount

image

Raw Results from Two Runs

With DOTNET_AVX512F=1

Method Job Toolchain size encName Mean Error StdDev Median Min Max Ratio RatioSD Gen0 Gen1 Allocated Alloc Ratio
GetBytes Job-GLULPR \runtime-base\artifacts\bin\testhost\net8.0-windows-Release-x64\shared\Microsoft.NETCore.App\8.0.0\corerun.exe 16 utf-8 16.870 ns 0.3963 ns 0.4069 ns 16.946 ns 16.094 ns 17.412 ns 1.00 0.00 0.0021 - 40 B 1.00
GetBytes Job-POYNXH \runtime\artifacts\bin\testhost\net8.0-windows-Release-x64\shared\Microsoft.NETCore.App\8.0.0\corerun.exe 16 utf-8 17.409 ns 0.3935 ns 0.3865 ns 17.524 ns 16.928 ns 18.138 ns 1.03 0.03 0.0021 - 40 B 1.00
GetChars Job-GLULPR \runtime-base\artifacts\bin\testhost\net8.0-windows-Release-x64\shared\Microsoft.NETCore.App\8.0.0\corerun.exe 16 utf-8 24.459 ns 0.5415 ns 0.5561 ns 24.677 ns 23.679 ns 25.508 ns 1.00 0.00 0.0029 - 56 B 1.00
GetChars Job-POYNXH \runtime\artifacts\bin\testhost\net8.0-windows-Release-x64\shared\Microsoft.NETCore.App\8.0.0\corerun.exe 16 utf-8 26.097 ns 0.4077 ns 0.3814 ns 26.229 ns 25.364 ns 26.878 ns 1.07 0.03 0.0029 - 56 B 1.00
GetByteCount Job-GLULPR \runtime-base\artifacts\bin\testhost\net8.0-windows-Release-x64\shared\Microsoft.NETCore.App\8.0.0\corerun.exe 16 utf-8 9.204 ns 0.4215 ns 0.4853 ns 9.283 ns 7.837 ns 9.642 ns 1.00 0.00 - - - NA
GetByteCount Job-POYNXH \runtime\artifacts\bin\testhost\net8.0-windows-Release-x64\shared\Microsoft.NETCore.App\8.0.0\corerun.exe 16 utf-8 8.027 ns 0.0503 ns 0.0392 ns 8.017 ns 7.981 ns 8.081 ns 0.88 0.07 - - - NA
GetCharCount Job-GLULPR \runtime-base\artifacts\bin\testhost\net8.0-windows-Release-x64\shared\Microsoft.NETCore.App\8.0.0\corerun.exe 16 utf-8 5.928 ns 0.0466 ns 0.0413 ns 5.924 ns 5.882 ns 5.993 ns 1.00 0.00 - - - NA
GetCharCount Job-POYNXH \runtime\artifacts\bin\testhost\net8.0-windows-Release-x64\shared\Microsoft.NETCore.App\8.0.0\corerun.exe 16 utf-8 7.510 ns 0.1455 ns 0.1361 ns 7.429 ns 7.360 ns 7.777 ns 1.27 0.03 - - - NA
GetBytes Job-GLULPR \runtime-base\artifacts\bin\testhost\net8.0-windows-Release-x64\shared\Microsoft.NETCore.App\8.0.0\corerun.exe 32 utf-8 16.694 ns 0.1554 ns 0.1213 ns 16.644 ns 16.571 ns 16.952 ns 1.00 0.00 0.0029 - 56 B 1.00
GetBytes Job-POYNXH \runtime\artifacts\bin\testhost\net8.0-windows-Release-x64\shared\Microsoft.NETCore.App\8.0.0\corerun.exe 32 utf-8 16.070 ns 0.0626 ns 0.0523 ns 16.075 ns 15.967 ns 16.149 ns 0.96 0.01 0.0029 - 56 B 1.00
GetChars Job-GLULPR \runtime-base\artifacts\bin\testhost\net8.0-windows-Release-x64\shared\Microsoft.NETCore.App\8.0.0\corerun.exe 32 utf-8 25.220 ns 0.1805 ns 0.1600 ns 25.204 ns 25.009 ns 25.451 ns 1.00 0.00 0.0046 - 88 B 1.00
GetChars Job-POYNXH \runtime\artifacts\bin\testhost\net8.0-windows-Release-x64\shared\Microsoft.NETCore.App\8.0.0\corerun.exe 32 utf-8 25.691 ns 0.2295 ns 0.1916 ns 25.649 ns 25.316 ns 25.995 ns 1.02 0.01 0.0046 - 88 B 1.00
GetByteCount Job-GLULPR \runtime-base\artifacts\bin\testhost\net8.0-windows-Release-x64\shared\Microsoft.NETCore.App\8.0.0\corerun.exe 32 utf-8 8.251 ns 0.0137 ns 0.0121 ns 8.254 ns 8.223 ns 8.265 ns 1.00 0.00 - - - NA
GetByteCount Job-POYNXH \runtime\artifacts\bin\testhost\net8.0-windows-Release-x64\shared\Microsoft.NETCore.App\8.0.0\corerun.exe 32 utf-8 7.585 ns 0.0211 ns 0.0198 ns 7.585 ns 7.552 ns 7.623 ns 0.92 0.00 - - - NA
GetCharCount Job-GLULPR \runtime-base\artifacts\bin\testhost\net8.0-windows-Release-x64\shared\Microsoft.NETCore.App\8.0.0\corerun.exe 32 utf-8 6.889 ns 0.1059 ns 0.0939 ns 6.913 ns 6.563 ns 6.924 ns 1.00 0.00 - - - NA
GetCharCount Job-POYNXH \runtime\artifacts\bin\testhost\net8.0-windows-Release-x64\shared\Microsoft.NETCore.App\8.0.0\corerun.exe 32 utf-8 7.681 ns 0.0036 ns 0.0030 ns 7.681 ns 7.675 ns 7.686 ns 1.12 0.02 - - - NA
GetBytes Job-GLULPR \runtime-base\artifacts\bin\testhost\net8.0-windows-Release-x64\shared\Microsoft.NETCore.App\8.0.0\corerun.exe 64 utf-8 30.093 ns 0.1604 ns 0.1339 ns 30.134 ns 29.827 ns 30.335 ns 1.00 0.00 0.0046 - 88 B 1.00
GetBytes Job-POYNXH \runtime\artifacts\bin\testhost\net8.0-windows-Release-x64\shared\Microsoft.NETCore.App\8.0.0\corerun.exe 64 utf-8 29.859 ns 0.2332 ns 0.2068 ns 29.885 ns 29.535 ns 30.140 ns 0.99 0.01 0.0046 - 88 B 1.00
GetChars Job-GLULPR \runtime-base\artifacts\bin\testhost\net8.0-windows-Release-x64\shared\Microsoft.NETCore.App\8.0.0\corerun.exe 64 utf-8 29.528 ns 0.2547 ns 0.2258 ns 29.507 ns 29.246 ns 30.069 ns 1.00 0.00 0.0081 - 152 B 1.00
GetChars Job-POYNXH \runtime\artifacts\bin\testhost\net8.0-windows-Release-x64\shared\Microsoft.NETCore.App\8.0.0\corerun.exe 64 utf-8 31.390 ns 0.2142 ns 0.1899 ns 31.411 ns 30.957 ns 31.668 ns 1.06 0.01 0.0080 - 152 B 1.00
GetByteCount Job-GLULPR \runtime-base\artifacts\bin\testhost\net8.0-windows-Release-x64\shared\Microsoft.NETCore.App\8.0.0\corerun.exe 64 utf-8 9.404 ns 0.0206 ns 0.0192 ns 9.404 ns 9.366 ns 9.435 ns 1.00 0.00 - - - NA
GetByteCount Job-POYNXH \runtime\artifacts\bin\testhost\net8.0-windows-Release-x64\shared\Microsoft.NETCore.App\8.0.0\corerun.exe 64 utf-8 9.203 ns 0.0503 ns 0.0471 ns 9.209 ns 9.094 ns 9.255 ns 0.98 0.01 - - - NA
GetCharCount Job-GLULPR \runtime-base\artifacts\bin\testhost\net8.0-windows-Release-x64\shared\Microsoft.NETCore.App\8.0.0\corerun.exe 64 utf-8 7.445 ns 0.0030 ns 0.0027 ns 7.446 ns 7.440 ns 7.448 ns 1.00 0.00 - - - NA
GetCharCount Job-POYNXH \runtime\artifacts\bin\testhost\net8.0-windows-Release-x64\shared\Microsoft.NETCore.App\8.0.0\corerun.exe 64 utf-8 8.367 ns 0.0072 ns 0.0064 ns 8.367 ns 8.356 ns 8.379 ns 1.12 0.00 - - - NA
GetBytes Job-GLULPR \runtime-base\artifacts\bin\testhost\net8.0-windows-Release-x64\shared\Microsoft.NETCore.App\8.0.0\corerun.exe 128 utf-8 36.796 ns 0.2551 ns 0.2261 ns 36.797 ns 36.321 ns 37.218 ns 1.00 0.00 0.0081 - 152 B 1.00
GetBytes Job-POYNXH \runtime\artifacts\bin\testhost\net8.0-windows-Release-x64\shared\Microsoft.NETCore.App\8.0.0\corerun.exe 128 utf-8 36.670 ns 0.2703 ns 0.2396 ns 36.661 ns 36.354 ns 37.150 ns 1.00 0.01 0.0080 - 152 B 1.00
GetChars Job-GLULPR \runtime-base\artifacts\bin\testhost\net8.0-windows-Release-x64\shared\Microsoft.NETCore.App\8.0.0\corerun.exe 128 utf-8 37.293 ns 0.3142 ns 0.2785 ns 37.330 ns 36.756 ns 37.780 ns 1.00 0.00 0.0148 - 280 B 1.00
GetChars Job-POYNXH \runtime\artifacts\bin\testhost\net8.0-windows-Release-x64\shared\Microsoft.NETCore.App\8.0.0\corerun.exe 128 utf-8 38.340 ns 0.2087 ns 0.1850 ns 38.320 ns 38.038 ns 38.642 ns 1.03 0.01 0.0148 - 280 B 1.00
GetByteCount Job-GLULPR \runtime-base\artifacts\bin\testhost\net8.0-windows-Release-x64\shared\Microsoft.NETCore.App\8.0.0\corerun.exe 128 utf-8 11.914 ns 0.0415 ns 0.0388 ns 11.923 ns 11.815 ns 11.957 ns 1.00 0.00 - - - NA
GetByteCount Job-POYNXH \runtime\artifacts\bin\testhost\net8.0-windows-Release-x64\shared\Microsoft.NETCore.App\8.0.0\corerun.exe 128 utf-8 10.412 ns 0.0483 ns 0.0452 ns 10.415 ns 10.317 ns 10.499 ns 0.87 0.01 - - - NA
GetCharCount Job-GLULPR \runtime-base\artifacts\bin\testhost\net8.0-windows-Release-x64\shared\Microsoft.NETCore.App\8.0.0\corerun.exe 128 utf-8 8.595 ns 0.0078 ns 0.0069 ns 8.597 ns 8.580 ns 8.606 ns 1.00 0.00 - - - NA
GetCharCount Job-POYNXH \runtime\artifacts\bin\testhost\net8.0-windows-Release-x64\shared\Microsoft.NETCore.App\8.0.0\corerun.exe 128 utf-8 11.532 ns 0.0134 ns 0.0119 ns 11.531 ns 11.514 ns 11.553 ns 1.34 0.00 - - - NA
GetBytes Job-GLULPR \runtime-base\artifacts\bin\testhost\net8.0-windows-Release-x64\shared\Microsoft.NETCore.App\8.0.0\corerun.exe 256 utf-8 50.754 ns 0.0815 ns 0.0637 ns 50.762 ns 50.649 ns 50.848 ns 1.00 0.00 0.0148 - 280 B 1.00
GetBytes Job-POYNXH \runtime\artifacts\bin\testhost\net8.0-windows-Release-x64\shared\Microsoft.NETCore.App\8.0.0\corerun.exe 256 utf-8 46.471 ns 0.0475 ns 0.0397 ns 46.468 ns 46.417 ns 46.560 ns 0.92 0.00 0.0149 - 280 B 1.00
GetChars Job-GLULPR \runtime-base\artifacts\bin\testhost\net8.0-windows-Release-x64\shared\Microsoft.NETCore.App\8.0.0\corerun.exe 256 utf-8 53.838 ns 0.4639 ns 0.4112 ns 53.651 ns 53.322 ns 54.746 ns 1.00 0.00 0.0284 - 536 B 1.00
GetChars Job-POYNXH \runtime\artifacts\bin\testhost\net8.0-windows-Release-x64\shared\Microsoft.NETCore.App\8.0.0\corerun.exe 256 utf-8 54.503 ns 1.1033 ns 0.9780 ns 54.805 ns 53.152 ns 56.528 ns 1.01 0.02 0.0283 - 536 B 1.00
GetByteCount Job-GLULPR \runtime-base\artifacts\bin\testhost\net8.0-windows-Release-x64\shared\Microsoft.NETCore.App\8.0.0\corerun.exe 256 utf-8 16.165 ns 0.1093 ns 0.1023 ns 16.188 ns 15.907 ns 16.303 ns 1.00 0.00 - - - NA
GetByteCount Job-POYNXH \runtime\artifacts\bin\testhost\net8.0-windows-Release-x64\shared\Microsoft.NETCore.App\8.0.0\corerun.exe 256 utf-8 14.100 ns 0.2427 ns 0.2270 ns 14.222 ns 13.641 ns 14.296 ns 0.87 0.02 - - - NA
GetCharCount Job-GLULPR \runtime-base\artifacts\bin\testhost\net8.0-windows-Release-x64\shared\Microsoft.NETCore.App\8.0.0\corerun.exe 256 utf-8 11.281 ns 0.0065 ns 0.0061 ns 11.284 ns 11.268 ns 11.287 ns 1.00 0.00 - - - NA
GetCharCount Job-POYNXH \runtime\artifacts\bin\testhost\net8.0-windows-Release-x64\shared\Microsoft.NETCore.App\8.0.0\corerun.exe 256 utf-8 12.991 ns 0.0526 ns 0.0492 ns 12.991 ns 12.927 ns 13.075 ns 1.15 0.00 - - - NA
GetBytes Job-GLULPR \runtime-base\artifacts\bin\testhost\net8.0-windows-Release-x64\shared\Microsoft.NETCore.App\8.0.0\corerun.exe 512 utf-8 84.888 ns 0.6871 ns 0.5738 ns 85.025 ns 83.542 ns 85.490 ns 1.00 0.00 0.0283 - 536 B 1.00
GetBytes Job-POYNXH \runtime\artifacts\bin\testhost\net8.0-windows-Release-x64\shared\Microsoft.NETCore.App\8.0.0\corerun.exe 512 utf-8 71.747 ns 0.9126 ns 0.8536 ns 71.638 ns 70.464 ns 73.243 ns 0.84 0.01 0.0283 - 536 B 1.00
GetChars Job-GLULPR \runtime-base\artifacts\bin\testhost\net8.0-windows-Release-x64\shared\Microsoft.NETCore.App\8.0.0\corerun.exe 512 utf-8 90.956 ns 1.3016 ns 1.1538 ns 90.855 ns 89.107 ns 93.114 ns 1.00 0.00 0.0556 - 1048 B 1.00
GetChars Job-POYNXH \runtime\artifacts\bin\testhost\net8.0-windows-Release-x64\shared\Microsoft.NETCore.App\8.0.0\corerun.exe 512 utf-8 88.008 ns 0.4077 ns 0.3615 ns 87.946 ns 87.422 ns 88.657 ns 0.97 0.01 0.0555 - 1048 B 1.00
GetByteCount Job-GLULPR \runtime-base\artifacts\bin\testhost\net8.0-windows-Release-x64\shared\Microsoft.NETCore.App\8.0.0\corerun.exe 512 utf-8 24.785 ns 0.1321 ns 0.1171 ns 24.794 ns 24.454 ns 24.955 ns 1.00 0.00 - - - NA
GetByteCount Job-POYNXH \runtime\artifacts\bin\testhost\net8.0-windows-Release-x64\shared\Microsoft.NETCore.App\8.0.0\corerun.exe 512 utf-8 15.675 ns 0.0271 ns 0.0254 ns 15.681 ns 15.625 ns 15.708 ns 0.63 0.00 - - - NA
GetCharCount Job-GLULPR \runtime-base\artifacts\bin\testhost\net8.0-windows-Release-x64\shared\Microsoft.NETCore.App\8.0.0\corerun.exe 512 utf-8 15.559 ns 0.0052 ns 0.0046 ns 15.559 ns 15.551 ns 15.567 ns 1.00 0.00 - - - NA
GetCharCount Job-POYNXH \runtime\artifacts\bin\testhost\net8.0-windows-Release-x64\shared\Microsoft.NETCore.App\8.0.0\corerun.exe 512 utf-8 16.235 ns 0.0329 ns 0.0308 ns 16.235 ns 16.191 ns 16.305 ns 1.04 0.00 - - - NA
GetBytes Job-GLULPR \runtime-base\artifacts\bin\testhost\net8.0-windows-Release-x64\shared\Microsoft.NETCore.App\8.0.0\corerun.exe 1024 utf-8 174.537 ns 1.2549 ns 1.1124 ns 174.597 ns 172.235 ns 176.644 ns 1.00 0.00 0.0550 - 1048 B 1.00
GetBytes Job-POYNXH \runtime\artifacts\bin\testhost\net8.0-windows-Release-x64\shared\Microsoft.NETCore.App\8.0.0\corerun.exe 1024 utf-8 129.363 ns 1.2396 ns 1.0988 ns 129.508 ns 127.232 ns 130.718 ns 0.74 0.01 0.0556 - 1048 B 1.00
GetChars Job-GLULPR \runtime-base\artifacts\bin\testhost\net8.0-windows-Release-x64\shared\Microsoft.NETCore.App\8.0.0\corerun.exe 1024 utf-8 170.690 ns 1.3975 ns 1.2389 ns 170.792 ns 168.408 ns 172.876 ns 1.00 0.00 0.1100 0.0007 2072 B 1.00
GetChars Job-POYNXH \runtime\artifacts\bin\testhost\net8.0-windows-Release-x64\shared\Microsoft.NETCore.App\8.0.0\corerun.exe 1024 utf-8 148.640 ns 0.3393 ns 0.2833 ns 148.566 ns 148.082 ns 149.138 ns 0.87 0.01 0.1100 0.0007 2072 B 1.00
GetByteCount Job-GLULPR \runtime-base\artifacts\bin\testhost\net8.0-windows-Release-x64\shared\Microsoft.NETCore.App\8.0.0\corerun.exe 1024 utf-8 51.289 ns 0.2287 ns 0.2140 ns 51.320 ns 50.601 ns 51.489 ns 1.00 0.00 - - - NA
GetByteCount Job-POYNXH \runtime\artifacts\bin\testhost\net8.0-windows-Release-x64\shared\Microsoft.NETCore.App\8.0.0\corerun.exe 1024 utf-8 34.275 ns 1.2877 ns 1.4829 ns 34.238 ns 31.380 ns 36.673 ns 0.66 0.03 - - - NA
GetCharCount Job-GLULPR \runtime-base\artifacts\bin\testhost\net8.0-windows-Release-x64\shared\Microsoft.NETCore.App\8.0.0\corerun.exe 1024 utf-8 33.042 ns 0.6956 ns 0.7143 ns 33.531 ns 31.532 ns 33.683 ns 1.00 0.00 - - - NA
GetCharCount Job-POYNXH \runtime\artifacts\bin\testhost\net8.0-windows-Release-x64\shared\Microsoft.NETCore.App\8.0.0\corerun.exe 1024 utf-8 20.591 ns 0.2576 ns 0.2409 ns 20.498 ns 20.310 ns 20.989 ns 0.62 0.01 - - - NA

With DOTNET_EnableAVX512F=0

Method Job Toolchain size encName Mean Error StdDev Median Min Max Ratio RatioSD Gen0 Gen1 Allocated Alloc Ratio
GetBytes Job-ZAMPDT \runtime-base\artifacts\bin\testhost\net8.0-windows-Release-x64\shared\Microsoft.NETCore.App\8.0.0\corerun.exe 16 utf-8 15.685 ns 0.2940 ns 0.2750 ns 15.791 ns 15.234 ns 16.202 ns 1.00 0.00 0.0021 - 40 B 1.00
GetBytes Job-RVECRT \runtime\artifacts\bin\testhost\net8.0-windows-Release-x64\shared\Microsoft.NETCore.App\8.0.0\corerun.exe 16 utf-8 15.482 ns 0.0969 ns 0.0809 ns 15.458 ns 15.378 ns 15.633 ns 0.99 0.02 0.0021 - 40 B 1.00
GetChars Job-ZAMPDT \runtime-base\artifacts\bin\testhost\net8.0-windows-Release-x64\shared\Microsoft.NETCore.App\8.0.0\corerun.exe 16 utf-8 22.264 ns 0.4973 ns 0.5528 ns 22.087 ns 21.686 ns 23.379 ns 1.00 0.00 0.0029 - 56 B 1.00
GetChars Job-RVECRT \runtime\artifacts\bin\testhost\net8.0-windows-Release-x64\shared\Microsoft.NETCore.App\8.0.0\corerun.exe 16 utf-8 23.657 ns 0.2815 ns 0.2495 ns 23.623 ns 23.312 ns 24.073 ns 1.06 0.03 0.0029 - 56 B 1.00
GetByteCount Job-ZAMPDT \runtime-base\artifacts\bin\testhost\net8.0-windows-Release-x64\shared\Microsoft.NETCore.App\8.0.0\corerun.exe 16 utf-8 8.275 ns 0.0317 ns 0.0247 ns 8.274 ns 8.235 ns 8.325 ns 1.00 0.00 - - - NA
GetByteCount Job-RVECRT \runtime\artifacts\bin\testhost\net8.0-windows-Release-x64\shared\Microsoft.NETCore.App\8.0.0\corerun.exe 16 utf-8 8.304 ns 0.0380 ns 0.0297 ns 8.306 ns 8.245 ns 8.345 ns 1.00 0.00 - - - NA
GetCharCount Job-ZAMPDT \runtime-base\artifacts\bin\testhost\net8.0-windows-Release-x64\shared\Microsoft.NETCore.App\8.0.0\corerun.exe 16 utf-8 5.953 ns 0.0696 ns 0.0617 ns 5.931 ns 5.891 ns 6.074 ns 1.00 0.00 - - - NA
GetCharCount Job-RVECRT \runtime\artifacts\bin\testhost\net8.0-windows-Release-x64\shared\Microsoft.NETCore.App\8.0.0\corerun.exe 16 utf-8 7.347 ns 0.0277 ns 0.0246 ns 7.342 ns 7.302 ns 7.390 ns 1.23 0.01 - - - NA
GetBytes Job-ZAMPDT \runtime-base\artifacts\bin\testhost\net8.0-windows-Release-x64\shared\Microsoft.NETCore.App\8.0.0\corerun.exe 32 utf-8 16.420 ns 0.3640 ns 0.3226 ns 16.284 ns 16.096 ns 17.034 ns 1.00 0.00 0.0030 - 56 B 1.00
GetBytes Job-RVECRT \runtime\artifacts\bin\testhost\net8.0-windows-Release-x64\shared\Microsoft.NETCore.App\8.0.0\corerun.exe 32 utf-8 16.727 ns 0.1400 ns 0.1241 ns 16.713 ns 16.554 ns 16.938 ns 1.02 0.02 0.0029 - 56 B 1.00
GetChars Job-ZAMPDT \runtime-base\artifacts\bin\testhost\net8.0-windows-Release-x64\shared\Microsoft.NETCore.App\8.0.0\corerun.exe 32 utf-8 24.938 ns 0.4161 ns 0.3689 ns 24.880 ns 24.519 ns 25.590 ns 1.00 0.00 0.0046 - 88 B 1.00
GetChars Job-RVECRT \runtime\artifacts\bin\testhost\net8.0-windows-Release-x64\shared\Microsoft.NETCore.App\8.0.0\corerun.exe 32 utf-8 24.797 ns 0.1152 ns 0.1021 ns 24.805 ns 24.644 ns 24.927 ns 0.99 0.02 0.0047 - 88 B 1.00
GetByteCount Job-ZAMPDT \runtime-base\artifacts\bin\testhost\net8.0-windows-Release-x64\shared\Microsoft.NETCore.App\8.0.0\corerun.exe 32 utf-8 8.893 ns 0.0236 ns 0.0209 ns 8.894 ns 8.853 ns 8.925 ns 1.00 0.00 - - - NA
GetByteCount Job-RVECRT \runtime\artifacts\bin\testhost\net8.0-windows-Release-x64\shared\Microsoft.NETCore.App\8.0.0\corerun.exe 32 utf-8 7.642 ns 0.0171 ns 0.0152 ns 7.641 ns 7.618 ns 7.673 ns 0.86 0.00 - - - NA
GetCharCount Job-ZAMPDT \runtime-base\artifacts\bin\testhost\net8.0-windows-Release-x64\shared\Microsoft.NETCore.App\8.0.0\corerun.exe 32 utf-8 6.666 ns 0.0187 ns 0.0175 ns 6.667 ns 6.639 ns 6.703 ns 1.00 0.00 - - - NA
GetCharCount Job-RVECRT \runtime\artifacts\bin\testhost\net8.0-windows-Release-x64\shared\Microsoft.NETCore.App\8.0.0\corerun.exe 32 utf-8 7.687 ns 0.0064 ns 0.0060 ns 7.686 ns 7.676 ns 7.698 ns 1.15 0.00 - - - NA
GetBytes Job-ZAMPDT \runtime-base\artifacts\bin\testhost\net8.0-windows-Release-x64\shared\Microsoft.NETCore.App\8.0.0\corerun.exe 64 utf-8 28.484 ns 0.2591 ns 0.2297 ns 28.462 ns 28.234 ns 28.920 ns 1.00 0.00 0.0047 - 88 B 1.00
GetBytes Job-RVECRT \runtime\artifacts\bin\testhost\net8.0-windows-Release-x64\shared\Microsoft.NETCore.App\8.0.0\corerun.exe 64 utf-8 30.190 ns 0.5999 ns 0.5009 ns 30.452 ns 29.091 ns 30.659 ns 1.06 0.02 0.0047 - 88 B 1.00
GetChars Job-ZAMPDT \runtime-base\artifacts\bin\testhost\net8.0-windows-Release-x64\shared\Microsoft.NETCore.App\8.0.0\corerun.exe 64 utf-8 29.352 ns 0.2424 ns 0.2268 ns 29.315 ns 28.931 ns 29.652 ns 1.00 0.00 0.0080 - 152 B 1.00
GetChars Job-RVECRT \runtime\artifacts\bin\testhost\net8.0-windows-Release-x64\shared\Microsoft.NETCore.App\8.0.0\corerun.exe 64 utf-8 28.814 ns 0.2307 ns 0.2045 ns 28.849 ns 28.504 ns 29.228 ns 0.98 0.01 0.0081 - 152 B 1.00
GetByteCount Job-ZAMPDT \runtime-base\artifacts\bin\testhost\net8.0-windows-Release-x64\shared\Microsoft.NETCore.App\8.0.0\corerun.exe 64 utf-8 8.681 ns 0.0169 ns 0.0158 ns 8.680 ns 8.653 ns 8.714 ns 1.00 0.00 - - - NA
GetByteCount Job-RVECRT \runtime\artifacts\bin\testhost\net8.0-windows-Release-x64\shared\Microsoft.NETCore.App\8.0.0\corerun.exe 64 utf-8 9.201 ns 0.0190 ns 0.0159 ns 9.204 ns 9.165 ns 9.221 ns 1.06 0.00 - - - NA
GetCharCount Job-ZAMPDT \runtime-base\artifacts\bin\testhost\net8.0-windows-Release-x64\shared\Microsoft.NETCore.App\8.0.0\corerun.exe 64 utf-8 7.376 ns 0.0084 ns 0.0070 ns 7.374 ns 7.368 ns 7.392 ns 1.00 0.00 - - - NA
GetCharCount Job-RVECRT \runtime\artifacts\bin\testhost\net8.0-windows-Release-x64\shared\Microsoft.NETCore.App\8.0.0\corerun.exe 64 utf-8 8.089 ns 0.0232 ns 0.0217 ns 8.089 ns 8.038 ns 8.123 ns 1.10 0.00 - - - NA
GetBytes Job-ZAMPDT \runtime-base\artifacts\bin\testhost\net8.0-windows-Release-x64\shared\Microsoft.NETCore.App\8.0.0\corerun.exe 128 utf-8 36.537 ns 0.1701 ns 0.1421 ns 36.568 ns 36.284 ns 36.775 ns 1.00 0.00 0.0080 - 152 B 1.00
GetBytes Job-RVECRT \runtime\artifacts\bin\testhost\net8.0-windows-Release-x64\shared\Microsoft.NETCore.App\8.0.0\corerun.exe 128 utf-8 35.399 ns 0.1617 ns 0.1434 ns 35.347 ns 35.154 ns 35.641 ns 0.97 0.00 0.0080 - 152 B 1.00
GetChars Job-ZAMPDT \runtime-base\artifacts\bin\testhost\net8.0-windows-Release-x64\shared\Microsoft.NETCore.App\8.0.0\corerun.exe 128 utf-8 36.593 ns 0.4243 ns 0.3968 ns 36.610 ns 35.943 ns 37.308 ns 1.00 0.00 0.0148 - 280 B 1.00
GetChars Job-RVECRT \runtime\artifacts\bin\testhost\net8.0-windows-Release-x64\shared\Microsoft.NETCore.App\8.0.0\corerun.exe 128 utf-8 36.775 ns 0.3406 ns 0.3020 ns 36.863 ns 36.344 ns 37.137 ns 1.00 0.01 0.0148 - 280 B 1.00
GetByteCount Job-ZAMPDT \runtime-base\artifacts\bin\testhost\net8.0-windows-Release-x64\shared\Microsoft.NETCore.App\8.0.0\corerun.exe 128 utf-8 10.957 ns 0.0113 ns 0.0100 ns 10.959 ns 10.932 ns 10.972 ns 1.00 0.00 - - - NA
GetByteCount Job-RVECRT \runtime\artifacts\bin\testhost\net8.0-windows-Release-x64\shared\Microsoft.NETCore.App\8.0.0\corerun.exe 128 utf-8 10.908 ns 0.0235 ns 0.0220 ns 10.903 ns 10.880 ns 10.953 ns 1.00 0.00 - - - NA
GetCharCount Job-ZAMPDT \runtime-base\artifacts\bin\testhost\net8.0-windows-Release-x64\shared\Microsoft.NETCore.App\8.0.0\corerun.exe 128 utf-8 8.644 ns 0.0455 ns 0.0425 ns 8.645 ns 8.588 ns 8.736 ns 1.00 0.00 - - - NA
GetCharCount Job-RVECRT \runtime\artifacts\bin\testhost\net8.0-windows-Release-x64\shared\Microsoft.NETCore.App\8.0.0\corerun.exe 128 utf-8 8.824 ns 0.1000 ns 0.0935 ns 8.822 ns 8.647 ns 8.961 ns 1.02 0.01 - - - NA
GetBytes Job-ZAMPDT \runtime-base\artifacts\bin\testhost\net8.0-windows-Release-x64\shared\Microsoft.NETCore.App\8.0.0\corerun.exe 256 utf-8 52.601 ns 0.3706 ns 0.3285 ns 52.589 ns 52.028 ns 53.098 ns 1.00 0.00 0.0148 - 280 B 1.00
GetBytes Job-RVECRT \runtime\artifacts\bin\testhost\net8.0-windows-Release-x64\shared\Microsoft.NETCore.App\8.0.0\corerun.exe 256 utf-8 49.847 ns 0.3042 ns 0.2697 ns 49.815 ns 49.474 ns 50.335 ns 0.95 0.01 0.0147 - 280 B 1.00
GetChars Job-ZAMPDT \runtime-base\artifacts\bin\testhost\net8.0-windows-Release-x64\shared\Microsoft.NETCore.App\8.0.0\corerun.exe 256 utf-8 54.972 ns 0.5678 ns 0.5033 ns 54.919 ns 54.226 ns 56.047 ns 1.00 0.00 0.0283 - 536 B 1.00
GetChars Job-RVECRT \runtime\artifacts\bin\testhost\net8.0-windows-Release-x64\shared\Microsoft.NETCore.App\8.0.0\corerun.exe 256 utf-8 54.261 ns 0.9696 ns 0.9070 ns 54.289 ns 52.871 ns 56.101 ns 0.99 0.02 0.0283 - 536 B 1.00
GetByteCount Job-ZAMPDT \runtime-base\artifacts\bin\testhost\net8.0-windows-Release-x64\shared\Microsoft.NETCore.App\8.0.0\corerun.exe 256 utf-8 15.922 ns 0.0803 ns 0.0751 ns 15.917 ns 15.735 ns 16.020 ns 1.00 0.00 - - - NA
GetByteCount Job-RVECRT \runtime\artifacts\bin\testhost\net8.0-windows-Release-x64\shared\Microsoft.NETCore.App\8.0.0\corerun.exe 256 utf-8 14.330 ns 0.0253 ns 0.0237 ns 14.343 ns 14.288 ns 14.363 ns 0.90 0.00 - - - NA
GetCharCount Job-ZAMPDT \runtime-base\artifacts\bin\testhost\net8.0-windows-Release-x64\shared\Microsoft.NETCore.App\8.0.0\corerun.exe 256 utf-8 11.216 ns 0.1034 ns 0.0968 ns 11.214 ns 10.986 ns 11.365 ns 1.00 0.00 - - - NA
GetCharCount Job-RVECRT \runtime\artifacts\bin\testhost\net8.0-windows-Release-x64\shared\Microsoft.NETCore.App\8.0.0\corerun.exe 256 utf-8 10.533 ns 0.0370 ns 0.0309 ns 10.539 ns 10.476 ns 10.586 ns 0.94 0.01 - - - NA
GetBytes Job-ZAMPDT \runtime-base\artifacts\bin\testhost\net8.0-windows-Release-x64\shared\Microsoft.NETCore.App\8.0.0\corerun.exe 512 utf-8 99.951 ns 0.7868 ns 0.6974 ns 100.088 ns 98.710 ns 101.068 ns 1.00 0.00 0.0283 - 536 B 1.00
GetBytes Job-RVECRT \runtime\artifacts\bin\testhost\net8.0-windows-Release-x64\shared\Microsoft.NETCore.App\8.0.0\corerun.exe 512 utf-8 80.361 ns 0.5703 ns 0.5056 ns 80.284 ns 79.564 ns 81.469 ns 0.80 0.01 0.0281 - 536 B 1.00
GetChars Job-ZAMPDT \runtime-base\artifacts\bin\testhost\net8.0-windows-Release-x64\shared\Microsoft.NETCore.App\8.0.0\corerun.exe 512 utf-8 93.390 ns 1.0341 ns 0.9673 ns 93.122 ns 92.216 ns 95.289 ns 1.00 0.00 0.0554 - 1048 B 1.00
GetChars Job-RVECRT \runtime\artifacts\bin\testhost\net8.0-windows-Release-x64\shared\Microsoft.NETCore.App\8.0.0\corerun.exe 512 utf-8 89.364 ns 1.0377 ns 0.9199 ns 89.247 ns 87.715 ns 90.685 ns 0.96 0.01 0.0557 - 1048 B 1.00
GetByteCount Job-ZAMPDT \runtime-base\artifacts\bin\testhost\net8.0-windows-Release-x64\shared\Microsoft.NETCore.App\8.0.0\corerun.exe 512 utf-8 24.083 ns 0.0269 ns 0.0252 ns 24.090 ns 24.043 ns 24.120 ns 1.00 0.00 - - - NA
GetByteCount Job-RVECRT \runtime\artifacts\bin\testhost\net8.0-windows-Release-x64\shared\Microsoft.NETCore.App\8.0.0\corerun.exe 512 utf-8 20.248 ns 0.0847 ns 0.0792 ns 20.251 ns 19.979 ns 20.322 ns 0.84 0.00 - - - NA
GetCharCount Job-ZAMPDT \runtime-base\artifacts\bin\testhost\net8.0-windows-Release-x64\shared\Microsoft.NETCore.App\8.0.0\corerun.exe 512 utf-8 15.784 ns 0.0404 ns 0.0378 ns 15.798 ns 15.706 ns 15.836 ns 1.00 0.00 - - - NA
GetCharCount Job-RVECRT \runtime\artifacts\bin\testhost\net8.0-windows-Release-x64\shared\Microsoft.NETCore.App\8.0.0\corerun.exe 512 utf-8 13.821 ns 0.0650 ns 0.0608 ns 13.831 ns 13.689 ns 13.926 ns 0.88 0.00 - - - NA
GetBytes Job-ZAMPDT \runtime-base\artifacts\bin\testhost\net8.0-windows-Release-x64\shared\Microsoft.NETCore.App\8.0.0\corerun.exe 1024 utf-8 171.779 ns 1.3070 ns 1.1586 ns 171.905 ns 169.700 ns 174.232 ns 1.00 0.00 0.0551 - 1048 B 1.00
GetBytes Job-RVECRT \runtime\artifacts\bin\testhost\net8.0-windows-Release-x64\shared\Microsoft.NETCore.App\8.0.0\corerun.exe 1024 utf-8 156.507 ns 0.9721 ns 0.8617 ns 156.265 ns 155.498 ns 158.305 ns 0.91 0.01 0.0555 - 1048 B 1.00
GetChars Job-ZAMPDT \runtime-base\artifacts\bin\testhost\net8.0-windows-Release-x64\shared\Microsoft.NETCore.App\8.0.0\corerun.exe 1024 utf-8 167.936 ns 2.4062 ns 2.1330 ns 167.886 ns 165.697 ns 171.998 ns 1.00 0.00 0.1100 0.0007 2072 B 1.00
GetChars Job-RVECRT \runtime\artifacts\bin\testhost\net8.0-windows-Release-x64\shared\Microsoft.NETCore.App\8.0.0\corerun.exe 1024 utf-8 156.658 ns 1.2998 ns 1.1523 ns 157.029 ns 154.017 ns 158.111 ns 0.93 0.01 0.1099 0.0006 2072 B 1.00
GetByteCount Job-ZAMPDT \runtime-base\artifacts\bin\testhost\net8.0-windows-Release-x64\shared\Microsoft.NETCore.App\8.0.0\corerun.exe 1024 utf-8 49.613 ns 0.1716 ns 0.1521 ns 49.592 ns 49.279 ns 49.947 ns 1.00 0.00 - - - NA
GetByteCount Job-RVECRT \runtime\artifacts\bin\testhost\net8.0-windows-Release-x64\shared\Microsoft.NETCore.App\8.0.0\corerun.exe 1024 utf-8 49.040 ns 1.1058 ns 1.2734 ns 49.251 ns 44.486 ns 50.348 ns 0.99 0.03 - - - NA
GetCharCount Job-ZAMPDT \runtime-base\artifacts\bin\testhost\net8.0-windows-Release-x64\shared\Microsoft.NETCore.App\8.0.0\corerun.exe 1024 utf-8 26.012 ns 0.0622 ns 0.0581 ns 26.007 ns 25.938 ns 26.135 ns 1.00 0.00 - - - NA
GetCharCount Job-RVECRT \runtime\artifacts\bin\testhost\net8.0-windows-Release-x64\shared\Microsoft.NETCore.App\8.0.0\corerun.exe 1024 utf-8 19.002 ns 0.0659 ns 0.0550 ns 19.004 ns 18.890 ns 19.118 ns 0.73 0.00 - - - NA

@ghost ghost added the community-contribution Indicates that the PR has been added by a community member label Jul 7, 2023
@ghost
Copy link

ghost commented Jul 7, 2023

Tagging subscribers to this area: @dotnet/area-system-numerics
See info in area-owners.md if you want to be subscribed.

Issue Details

This PR lights up some code path in ASCII.Utility with Vector256/Vector512 code, namely, NarrowUtf16ToAscii, WidenAsciiToUtf16, GetIndexOfFirstNonAsciiChar, and GetIndexOfFirstNonAsciiByte.

For the GetIndexOfMethods, we have implemented the simpler, existing "default" code path but with the explicty VectorXX apis; for the Narrow/Widen methods, we have implemented the more complex SSE/Vector256 path but with the Vector256/Vector512 APIs. Right now, both are a slight tradeoff in terms of code complexity/performance.

We are open to adjusting the implementation style for either path. Perf numbers coming soon.

Author: anthonycanino
Assignees: -
Labels:

area-System.Numerics, community-contribution

Milestone: -

@anthonycanino
Copy link
Contributor Author

A rough summary of the results...

  1. We will see speedup with the Vector512 and Vector256 implementation on large data sizes.
  2. There is a lot of room to tune, particularly given the Vector512 and Vector256 paths do not use that much specialized intrinsics, though I think we want to favor this more general approach.

@anthonycanino
Copy link
Contributor Author

@dotnet/avx512-contrib can we get a review on this?

I checked the failures which look like they are related to a CPUID test, I don't think the changes would impact?

@tannergooding
Copy link
Member

CpuId failures are known and have a PR up to resolve them #88623

/// Uses double instead of long to get a single instruction instead of storing temps on general porpose register (or stack)
/// </remarks>
[MethodImpl(MethodImplOptions.AggressiveInlining)]
internal static void StoreLowerUnsafe<T>(this Vector256<T> source, ref T destination, nuint elementOffset = 0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why this instead of source.GetLower().StoreUnsafe(ref destination, elementOffset)?

{
uint SizeOfVector512InBytes = (uint)Vector512<byte>.Count; // JIT will make this a const

if (Unsafe.ReadUnaligned<Vector512<byte>>(pBuffer).ExtractMostSignificantBits() == 0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not this instead?

Suggested change
if (Unsafe.ReadUnaligned<Vector512<byte>>(pBuffer).ExtractMostSignificantBits() == 0)
if (Vector512.Load(pBuffer).ExtractMostSignificantBits() == 0)


if (Vector512.IsHardwareAccelerated && bufferLength >= 2 * (uint)Vector512<byte>.Count)
{
uint SizeOfVector512InBytes = (uint)Vector512<byte>.Count; // JIT will make this a const
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We actually have an internal Vector512.Size that is a const for just these types of cases.

uint SizeOfVectorInChars = (uint)Vector<ushort>.Count; // JIT will make this a const
uint SizeOfVectorInBytes = (uint)Vector<byte>.Count; // JIT will make this a const
uint SizeOfVector512InChars = (uint)Vector512<ushort>.Count; // JIT will make this a const
uint SizeOfVector512InBytes = (uint)Vector512.Size; // JIT will make this a const
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should be able to use Vecto512.Size directly and have C# keep it a const instead.

It will make things a bit more readable and allow C# to constant fold some of the simple cases (e.g. ~(nuint)(SizeOfVector512InBytes - 1) itself will become constant foldable rather than forcing IL to emit ldloc; sub; conv.i; sub and then requiring the JIT to optimize that down.

/// Uses double instead of long to get a single instruction instead of storing temps on general porpose register (or stack)
/// </remarks>
[MethodImpl(MethodImplOptions.AggressiveInlining)]
internal static void StoreLowerUnsafe<T>(this Vector512<T> source, ref T destination, nuint elementOffset = 0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same general question here as on the other. Why this helper rather than simply doing source.GetLower().StoreUnsafe(ref destination, elementOffset)

It looks like the right stuff happens for the former already and it avoids the JIT needing to do any inlining for it.


if (Vector512.IsHardwareAccelerated && bufferLength >= 2 * (uint)Vector512<byte>.Count)
{
uint SizeOfVector512InBytes = (uint)Vector512.Size; // JIT will make this a const
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here and other places. If we use Vector512.Size directly, we get much simply and constant foldable IL, allowing the JIT to do less work.

Since Vector512.Size is itself a constant, we shouldn't need to insert extra casts anywhere. But if we did and it was a non-trivial number, it would be better to declare this as const uint SizeOfVector512InBytes instead so the same constant folding could still happen.

@@ -1407,6 +1867,20 @@ private static bool VectorContainsNonAsciiChar(Vector128<byte> asciiVector)
}
}

[MethodImpl(MethodImplOptions.AggressiveInlining)]
private static bool VectorContainsNonAsciiChar(Vector256<byte> asciiVector)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need wrapper methods for this simple case?

Is there any reason it can't just be asciiVector != Vector256<byte>.Zero instead which allows ptest; jcc rather than movmsk, cmp; jcc?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was doing this for code reuse/readability.

Would you prefer to just inline the function?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think inlining it manually in this case would be better.

The abstractions like this are generally helpful when we have more complex logic differing between platforms or when there is a high likelihood to need to identify and change the pattern for the same pattern repeatedly in the future.

In cases like this, where we're really just doing a trivial comparison check, I don't think the helper buys us much in terms of readability/maintainability and it does have drawbacks in the form of forcing the JIT to do more work to inline/optimize the code.

// Narrows two vectors of words [ w7 w6 w5 w4 w3 w2 w1 w0 ] and [ w7' w6' w5' w4' w3' w2' w1' w0' ]
// to a vector of bytes [ b7 ... b0 b7' ... b0'].

// prefer architecture specific intrinsic as they don't perform additional AND like Vector512.Narrow does
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this comment is out of sync. We're using the non architecture specific Narrow below.

Likewise, given we're just calling the xplat narrow, do we need this helper method or can we just call Narrow directly and avoid the need to inline?

Comment on lines 2192 to 2193
uint SizeOfVector256 = (uint)Vector256<byte>.Count;
nuint MaskOfAllBitsInVector256 = (nuint)(SizeOfVector256 - 1);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These can both be const, using Vector256.Size directly and declaring const nuint MaskOfAllBitsInVector256 for the latter.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same for the equivalent case in Intrinsified_512

1. turn some variables into explicitly specified const.
2. removed some helper functions and inlined them.
Debug.Assert((nuint)pBuffer % Vector512.Size == 0, "Vector read should be aligned.");
if (Vector512.LoadAligned(pBuffer).ExtractMostSignificantBits() != 0)
{
break; // found non-ASCII data
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we're within the vectorized code path and see non-ASCII data, since we already have the return value of ExtractMostSignificantBits in a register somewhere, I wonder if it would make sense to tzcnt the result and return immediately rather than falling down the drain code path.

Not important for this review, just a random musing.


do
{
Debug.Assert((nuint)pBuffer % SizeOfVector512InChars == 0, "Vector read should be aligned.");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should read:

Debug.Assert((nuint)pBuffer % SizeOfVector512InBytes == 0, "Vector read should be aligned.");

(Looks like this is a bug in the original GetIndexOfFirstNonAsciiChar_Default method as well.)

{
const uint SizeOfVector512InChars = Vector512.Size / sizeof(ushort);

Vector512<ushort> asciiMask = Vector512.Create((ushort) 0xFF80);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unused local?

// We're going to get the best performance when we have aligned writes, so we'll take the
// hit of potentially unaligned reads in order to hit this sweet spot.

// pAsciiBuffer points to the start of the destination buffer, immediately before where we wrote
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Para should be updated: 0x10 bit, &pAsciiBuffer[SizeOfVector256 / 2], 16-byte write.

Looking back over the original comment this was copied from, I realize now what I originally wrote was word salad. 🙂 My comment wasn't intended to refer to the value stored at the referenced address, but rather the address itself. Basically, if you're trying to become 16-byte aligned, then one of the following must hold: (a) you can from your current position back up 7 or fewer bytes to achieve 16-byte alignment; or (b) you can write 8 bytes, bump the pointer, then back up 7 or fewer bytes to achieve 16-byte alignment.

For this method, since you're trying to become 32-byte aligned, those clauses become "15 or fewer bytes" and "you can write 16 bytes."

goto Finish;
}

// Turn the 32 ASCII chars we just read into 32 ASCII bytes, then copy it to the destination.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

16, not 32.


// First part was all ASCII, narrow and aligned write. Note we're only filling in the low half of the vector.

Debug.Assert(((nuint)pAsciiBuffer + currentOffsetInElements) % sizeof(ulong) == 0, "Destination should be ulong-aligned.");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Debug.Assert(((nuint)pAsciiBuffer + currentOffsetInElements) % sizeof(ulong) == 0, "Destination should be ulong-aligned.");
Debug.Assert(((nuint)pAsciiBuffer + currentOffsetInElements) % Vector128.Size == 0, "Destination should be 128-bit-aligned.");

// We're going to get the best performance when we have aligned writes, so we'll take the
// hit of potentially unaligned reads in order to hit this sweet spot.

// pAsciiBuffer points to the start of the destination buffer, immediately before where we wrote
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same feedback as earlier: fix up comments.


// First part was all ASCII, narrow and aligned write. Note we're only filling in the low half of the vector.

Debug.Assert(((nuint)pAsciiBuffer + currentOffsetInElements) % sizeof(ulong) == 0, "Destination should be ulong-aligned.");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Debug.Assert(((nuint)pAsciiBuffer + currentOffsetInElements) % sizeof(ulong) == 0, "Destination should be ulong-aligned.");
Debug.Assert(((nuint)pAsciiBuffer + currentOffsetInElements) % Vector256.Size == 0, "Destination should be 256-bit-aligned.");

break;
}

(Vector512<ushort> low, Vector512<ushort> upper) = Vector512.Widen(asciiVector);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: keep the same local variable names as the other blocks in this method.

@tannergooding
Copy link
Member

CC. @GrabYourPitchforks, looks like all your feedback has been addressed.

Comment on lines +104 to +108
if (Vector512.IsHardwareAccelerated || Vector256.IsHardwareAccelerated)
{
return GetIndexOfFirstNonAsciiByte_Vector(pBuffer, bufferLength);
}
else if (Sse2.IsSupported || (AdvSimd.IsSupported && BitConverter.IsLittleEndian))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GetIndexOfFirstNonAsciiByte_Vector has a Vector128.IsHardwareAccelerated code path. We can't just rely on that and delete GetIndexOfFirstNonAsciiByte_Intrinsified?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Vector128 fallback path is slower than the (more complex) Intrinsified path (see #88532 (comment)).

@tannergooding tannergooding merged commit a513676 into dotnet:main Jul 17, 2023
163 of 166 checks passed
@dotnet dotnet locked as resolved and limited conversation to collaborators Aug 17, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
arch-avx512 Related to the AVX-512 architecture area-System.Numerics community-contribution Indicates that the PR has been added by a community member
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants