try to port ASCIIUtility.WidenAsciiToUtf16 to x-plat intrinsics #73055

adamsitnik · 2022-07-29T10:30:35Z

EDIT: for updated perf numbers please go to #73055 (comment)

x64

Initially, there was a major regression, but I was able to solve it by enforcing the inlining of Vector128.Widen(Vector128<byte>) . After porting everything the new implementation was on par:

BenchmarkDotNet=v0.13.1.1828-nightly, OS=Windows 11 (10.0.22000.795/21H2)
AMD Ryzen Threadripper PRO 3945WX 12-Cores, 1 CPU, 24 logical and 12 physical cores
.NET SDK=7.0.100-preview.6.22352.1
  [Host]     : .NET 7.0.0 (7.0.22.32404), X64 RyuJIT AVX2
  Job-IVMYPN : .NET 7.0.0 (42.42.42.42424), X64 RyuJIT AVX2
  Job-BJNCCX : .NET 7.0.0 (42.42.42.42424), X64 RyuJIT AVX2

Type	Method	Job	size	encName	Input	Mean	Ratio
Perf_Encoding	GetString	PR	16	ascii	?	22.09 ns	0.97
Perf_Encoding	GetString	base	16	ascii	?	22.72 ns	1.00

Perf_Encoding	GetString	PR	16	utf-8	?	20.37 ns	0.99
Perf_Encoding	GetString	base	16	utf-8	?	20.55 ns	1.00

Perf_Encoding	GetString	PR	512	ascii	?	70.05 ns	0.93
Perf_Encoding	GetString	base	512	ascii	?	75.12 ns	1.00

Perf_Encoding	GetString	PR	512	utf-8	?	77.35 ns	1.00
Perf_Encoding	GetString	base	512	utf-8	?	78.87 ns	1.00

Perf_Utf8Encoding	GetString	PR	?	?	EnglishAllAscii	20,990.29 ns	0.99
Perf_Utf8Encoding	GetString	base	?	?	EnglishAllAscii	21,203.47 ns	1.00

Perf_Utf8Encoding	GetString	PR	?	?	EnglishMostlyAscii	125,617.25 ns	0.99
Perf_Utf8Encoding	GetString	base	?	?	EnglishMostlyAscii	126,595.44 ns	1.00

Perf_Utf8Encoding	GetString	PR	?	?	Chinese	156,988.65 ns	1.00
Perf_Utf8Encoding	GetString	base	?	?	Chinese	156,257.20 ns	1.00

Perf_Utf8Encoding	GetString	PR	?	?	Cyrillic	155,448.25 ns	1.00
Perf_Utf8Encoding	GetString	base	?	?	Cyrillic	155,961.57 ns	1.00

Perf_Utf8Encoding	GetString	PR	?	?	Greek	244,318.02 ns	1.00
Perf_Utf8Encoding	GetString	base	?	?	Greek	244,570.33 ns	1.00

With some additional optimizations and adding Vector256 code path it's now on par or up to 20% faster, depending on input (the more characters are ascii, the better).

BenchmarkDotNet=v0.13.1.1828-nightly, OS=Windows 11 (10.0.22000.795/21H2)
AMD Ryzen Threadripper PRO 3945WX 12-Cores, 1 CPU, 24 logical and 12 physical cores
.NET SDK=7.0.100-preview.6.22352.1
  [Host]     : .NET 7.0.0 (7.0.22.32404), X64 RyuJIT AVX2
  
Method=GetString

Type	Job	size	encName	Input	Mean	Ratio
Perf_Encoding	PR	16	ascii	?	22.22 ns	0.99
Perf_Encoding	main	16	ascii	?	22.56 ns	1.00

Perf_Encoding	PR	16	utf-8	?	20.19 ns	0.94
Perf_Encoding	main	16	utf-8	?	21.42 ns	1.00

Perf_Encoding	PR	512	ascii	?	57.58 ns	0.77
Perf_Encoding	main	512	ascii	?	74.38 ns	1.00

Perf_Encoding	PR	512	utf-8	?	67.43 ns	0.79
Perf_Encoding	main	512	utf-8	?	84.91 ns	1.00

Perf_Utf8Encoding	PR	?	?	EnglishAllAscii	20,610.46 ns	0.98
Perf_Utf8Encoding	main	?	?	EnglishAllAscii	21,046.16 ns	1.00

Perf_Utf8Encoding	PR	?	?	EnglishMostlyAscii	129,464.76 ns	1.00
Perf_Utf8Encoding	main	?	?	EnglishMostlyAscii	128,755.26 ns	1.00

Perf_Utf8Encoding	PR	?	?	Chinese	157,597.33 ns	1.03
Perf_Utf8Encoding	main	?	?	Chinese	153,599.10 ns	1.00

Perf_Utf8Encoding	PR	?	?	Cyrillic	151,221.52 ns	0.98
Perf_Utf8Encoding	main	?	?	Cyrillic	153,741.42 ns	1.00

Perf_Utf8Encoding	PR	?	?	Greek	245,325.88 ns	1.01
Perf_Utf8Encoding	main	?	?	Greek	243,378.47 ns	1.00

ARM64

Initially, after just mapping the code there was a major regression of 40-50%:

BenchmarkDotNet=v0.13.1.1828-nightly, OS=ubuntu 20.04
Unknown processor
.NET SDK=7.0.100-rc.1.22378.8
  [Host]     : .NET 7.0.0 (7.0.22.37802), Arm64 RyuJIT AdvSIMD
  Job-VUTCOY : .NET 7.0.0 (42.42.42.42424), Arm64 RyuJIT AdvSIMD
  Job-LKXRPH : .NET 7.0.0 (42.42.42.42424), Arm64 RyuJIT AdvSIMD

Type	Method	Toolchain	size	encName	Input	Mean	Ratio
Perf_Encoding	GetString	/7.0.0/corerun	16	ascii	?	90.41 ns	1.02
Perf_Encoding	GetString	/main/corerun	16	ascii	?	88.37 ns	1.00

Perf_Encoding	GetString	/7.0.0/corerun	16	utf-8	?	81.14 ns	1.00
Perf_Encoding	GetString	/main/corerun	16	utf-8	?	81.43 ns	1.00

Perf_Encoding	GetString	/7.0.0/corerun	512	ascii	?	376.65 ns	1.55
Perf_Encoding	GetString	/main/corerun	512	ascii	?	242.93 ns	1.00

Perf_Encoding	GetString	/7.0.0/corerun	512	utf-8	?	435.39 ns	1.43
Perf_Encoding	GetString	/main/corerun	512	utf-8	?	305.42 ns	1.00

Perf_Utf8Encoding	GetString	/7.0.0/corerun	?	?	EnglishAllAscii	142,873.52 ns	1.39
Perf_Utf8Encoding	GetString	/main/corerun	?	?	EnglishAllAscii	102,949.86 ns	1.00

Perf_Utf8Encoding	GetString	/7.0.0/corerun	?	?	EnglishMostlyAscii	267,682.01 ns	1.04
Perf_Utf8Encoding	GetString	/main/corerun	?	?	EnglishMostlyAscii	256,662.77 ns	1.00

Perf_Utf8Encoding	GetString	/7.0.0/corerun	?	?	Chinese	372,408.90 ns	0.99
Perf_Utf8Encoding	GetString	/main/corerun	?	?	Chinese	376,872.54 ns	1.00

Perf_Utf8Encoding	GetString	/7.0.0/corerun	?	?	Cyrillic	277,097.47 ns	1.00
Perf_Utf8Encoding	GetString	/main/corerun	?	?	Cyrillic	275,724.07 ns	1.00

Perf_Utf8Encoding	GetString	/7.0.0/corerun	?	?	Greek	418,494.63 ns	1.01
Perf_Utf8Encoding	GetString	/main/corerun	?	?	Greek	416,134.26 ns	1.00

After I replaced Vector128.Widen with Vector128.WidenLower and Vector128.WidenUpper, I was able to lower the regression to 10-16%

BenchmarkDotNet=v0.13.1.1828-nightly, OS=ubuntu 20.04
Unknown processor
.NET SDK=7.0.100-rc.1.22379.1
  [Host]     : .NET 7.0.0 (7.0.22.37802), Arm64 RyuJIT AdvSIMD
  
  Method=GetString

Type	Job	size	encName	Input	Mean	Ratio
Perf_Encoding	PR	16	ascii	?	91.36 ns	1.02
Perf_Encoding	main	16	ascii	?	89.58 ns	1.00

Perf_Encoding	PR	16	utf-8	?	81.10 ns	0.97
Perf_Encoding	main	16	utf-8	?	83.81 ns	1.00

Perf_Encoding	PR	512	ascii	?	282.59 ns	1.16
Perf_Encoding	main	512	ascii	?	244.48 ns	1.00

Perf_Encoding	PR	512	utf-8	?	337.99 ns	1.10
Perf_Encoding	main	512	utf-8	?	308.56 ns	1.00

Perf_Utf8Encoding	PR	?	?	EnglishAllAscii	114,550.43 ns	1.13
Perf_Utf8Encoding	main	?	?	EnglishAllAscii	101,123.74 ns	1.00

Perf_Utf8Encoding	PR	?	?	EnglishMostlyAscii	270,465.07 ns	1.06
Perf_Utf8Encoding	main	?	?	EnglishMostlyAscii	254,863.02 ns	1.00

Perf_Utf8Encoding	PR	?	?	Chinese	370,954.69 ns	0.97
Perf_Utf8Encoding	main	?	?	Chinese	382,811.25 ns	1.00

Perf_Utf8Encoding	PR	?	?	Cyrillic	278,273.22 ns	1.01
Perf_Utf8Encoding	main	?	?	Cyrillic	275,921.61 ns	1.00

Perf_Utf8Encoding	PR	?	?	Greek	418,066.91 ns	1.01
Perf_Utf8Encoding	main	?	?	Greek	415,772.69 ns	1.00

I am trying to update my Surface Pro X to Win 11, which will allow me to install VS 2022 (ARM64), build dotnet/runtime and try the new internal MS profiler. But I can't promise anything.

contributes to #64451

cc @tannergooding @GrabYourPitchforks

…formance regression

ghost · 2022-07-29T10:30:48Z

Tagging subscribers to this area: @dotnet/area-system-text-encoding
See info in area-owners.md if you want to be subscribed.

Issue Details

x64

There was a major regression, but I was able to solve it by enforcing the inlining of Vector128.Widen(Vector128<byte>) . The new implementation is now on par.

BenchmarkDotNet=v0.13.1.1828-nightly, OS=Windows 11 (10.0.22000.795/21H2)
AMD Ryzen Threadripper PRO 3945WX 12-Cores, 1 CPU, 24 logical and 12 physical cores
.NET SDK=7.0.100-preview.6.22352.1
  [Host]     : .NET 7.0.0 (7.0.22.32404), X64 RyuJIT AVX2
  Job-IVMYPN : .NET 7.0.0 (42.42.42.42424), X64 RyuJIT AVX2
  Job-BJNCCX : .NET 7.0.0 (42.42.42.42424), X64 RyuJIT AVX2

Type	Method	Job	size	encName	Input	Mean	Ratio
Perf_Encoding	GetString	PR	16	ascii	?	22.09 ns	0.97
Perf_Encoding	GetString	base	16	ascii	?	22.72 ns	1.00

Perf_Encoding	GetString	PR	16	utf-8	?	20.37 ns	0.99
Perf_Encoding	GetString	base	16	utf-8	?	20.55 ns	1.00

Perf_Encoding	GetString	PR	512	ascii	?	70.05 ns	0.93
Perf_Encoding	GetString	base	512	ascii	?	75.12 ns	1.00

Perf_Encoding	GetString	PR	512	utf-8	?	77.35 ns	1.00
Perf_Encoding	GetString	base	512	utf-8	?	78.87 ns	1.00

Perf_Utf8Encoding	GetString	PR	?	?	EnglishAllAscii	20,990.29 ns	0.99
Perf_Utf8Encoding	GetString	base	?	?	EnglishAllAscii	21,203.47 ns	1.00

Perf_Utf8Encoding	GetString	PR	?	?	EnglishMostlyAscii	125,617.25 ns	0.99
Perf_Utf8Encoding	GetString	base	?	?	EnglishMostlyAscii	126,595.44 ns	1.00

Perf_Utf8Encoding	GetString	PR	?	?	Chinese	156,988.65 ns	1.00
Perf_Utf8Encoding	GetString	base	?	?	Chinese	156,257.20 ns	1.00

Perf_Utf8Encoding	GetString	PR	?	?	Cyrillic	155,448.25 ns	1.00
Perf_Utf8Encoding	GetString	base	?	?	Cyrillic	155,961.57 ns	1.00

Perf_Utf8Encoding	GetString	PR	?	?	Greek	244,318.02 ns	1.00
Perf_Utf8Encoding	GetString	base	?	?	Greek	244,570.33 ns	1.00

ARM64

There is a major perf regression: 40-50%. I know that it's not caused by the ContainsNonAsciiByte changes, I suspect that ARMs Vector128.Widen* implementations are just suboptimal. I am just guessing because I currently can't get the ARM64 disassembly with profile information.

I am trying to update my Surface Pro X to Win 11, which will allow me to install VS 2022 (ARM64), build dotnet/runtime and try the new internal MS profiler. But I can't promise anything.

BenchmarkDotNet=v0.13.1.1828-nightly, OS=ubuntu 20.04
Unknown processor
.NET SDK=7.0.100-rc.1.22378.8
  [Host]     : .NET 7.0.0 (7.0.22.37802), Arm64 RyuJIT AdvSIMD
  Job-VUTCOY : .NET 7.0.0 (42.42.42.42424), Arm64 RyuJIT AdvSIMD
  Job-LKXRPH : .NET 7.0.0 (42.42.42.42424), Arm64 RyuJIT AdvSIMD

Type	Method	Toolchain	size	encName	Input	Mean	Ratio
Perf_Encoding	GetString	/7.0.0/corerun	16	ascii	?	90.41 ns	1.02
Perf_Encoding	GetString	/main/corerun	16	ascii	?	88.37 ns	1.00

Perf_Encoding	GetString	/7.0.0/corerun	16	utf-8	?	81.14 ns	1.00
Perf_Encoding	GetString	/main/corerun	16	utf-8	?	81.43 ns	1.00

Perf_Encoding	GetString	/7.0.0/corerun	512	ascii	?	376.65 ns	1.55
Perf_Encoding	GetString	/main/corerun	512	ascii	?	242.93 ns	1.00

Perf_Encoding	GetString	/7.0.0/corerun	512	utf-8	?	435.39 ns	1.43
Perf_Encoding	GetString	/main/corerun	512	utf-8	?	305.42 ns	1.00

Perf_Utf8Encoding	GetString	/7.0.0/corerun	?	?	EnglishAllAscii	142,873.52 ns	1.39
Perf_Utf8Encoding	GetString	/main/corerun	?	?	EnglishAllAscii	102,949.86 ns	1.00

Perf_Utf8Encoding	GetString	/7.0.0/corerun	?	?	EnglishMostlyAscii	267,682.01 ns	1.04
Perf_Utf8Encoding	GetString	/main/corerun	?	?	EnglishMostlyAscii	256,662.77 ns	1.00

Perf_Utf8Encoding	GetString	/7.0.0/corerun	?	?	Chinese	372,408.90 ns	0.99
Perf_Utf8Encoding	GetString	/main/corerun	?	?	Chinese	376,872.54 ns	1.00

Perf_Utf8Encoding	GetString	/7.0.0/corerun	?	?	Cyrillic	277,097.47 ns	1.00
Perf_Utf8Encoding	GetString	/main/corerun	?	?	Cyrillic	275,724.07 ns	1.00

Perf_Utf8Encoding	GetString	/7.0.0/corerun	?	?	Greek	418,494.63 ns	1.01
Perf_Utf8Encoding	GetString	/main/corerun	?	?	Greek	416,134.26 ns	1.00

contributes to #64451

cc @tannergooding @GrabYourPitchforks

Author:	adamsitnik
Assignees:	-
Labels:	`area-System.Text.Encoding`
Milestone:	-

…ciiDataSeenInInnerLoop was performed only when containsNonAsciiBytes was true)

tannergooding · 2022-07-29T12:32:27Z

src/libraries/System.Private.CoreLib/src/System/Runtime/Intrinsics/Vector128.cs

@@ -3618,6 +3618,7 @@ public static bool TryCopyTo<T>(this Vector128<T> vector, Span<T> destination)
        /// <param name="source">The vector whose elements are to be widened.</param>
        /// <returns>A pair of vectors that contain the widened lower and upper halves of <paramref name="source" />.</returns>
        [CLSCompliant(false)]
+        [MethodImpl(MethodImplOptions.AggressiveInlining)]


We should add this to all the Widen APIs and to the same on Vector64.

Can the JIT use named intrinsic and treat all methods in Vector###<T> as candidates for aggressive inlining?

@tannergooding I've synced my fork with upstream and verified that it's not needed anymore. So we don't need to backport anything to 7.0

src/libraries/System.Private.CoreLib/src/System/Text/ASCIIUtility.cs

adamsitnik · 2022-07-29T13:01:12Z

I added Vector256 code path and got it up to 20% faster on x64.

I replaced Vector128.Widen with Vector128.WidenLower and Vector128.WidenUpper and got the ARM64 regression down to 10-16% (from 40-50%).

I updated the results posted above.

adamsitnik · 2022-07-29T13:04:55Z

src/libraries/System.Private.CoreLib/src/System/Text/ASCIIUtility.cs

-
-            // Can we at least widen the first part of the vector?
-
-            if (!containsNonAsciiBytes)


I've removed this code, as it was impossible to satisfy this requirement as the jump was performed only when the flag was set to true:

runtime/src/libraries/System.Private.CoreLib/src/System/Text/ASCIIUtility.cs

Lines 1823 to 1827 in 85d638b

if (containsNonAsciiBytes)

{

// non-ASCII byte somewhere

goto NonAsciiDataSeenInInnerLoop;

}

adamsitnik · 2022-07-29T13:06:43Z

src/libraries/System.Private.CoreLib/src/System/Text/ASCIIUtility.cs


-            // Calculate how many elements we wrote in order to get pOutputBuffer to its next alignment


based on the benchmarking that I've done this was not improving perf in noticeable way, but increasing the code complexity. I've removed it and added Vector256 code path (less code = less code to duplicate)

…uts on ARM

adamsitnik · 2022-08-19T14:17:12Z

Edit: it turned out to be a R2R bug: #74253

OK, I have no idea what is going on and I need an extra pair of eyes.

The tests are failing with Encountered infinite recursion while looking up resource 'Word_At' in System.Private.CoreLib.:

The callstack is quite long, but it shows that failure starts from WidenAsciiToUtf16_Vector256 method which I've just added:

   at System.Diagnostics.Debug.Fail(System.String, System.String)
   at System.Text.ASCIIUtility.WidenAsciiToUtf16_Vector256(Byte*, Char*, UIntPtr)
   at System.Text.ASCIIUtility.WidenAsciiToUtf16(Byte*, Char*, UIntPtr)

This method has 3 debug asserts:

runtime/src/libraries/System.Private.CoreLib/src/System/Text/ASCIIUtility.cs

Lines 1753 to 1757 in 20a2f23

    
           private static unsafe nuint WidenAsciiToUtf16_Vector256(byte* pAsciiBuffer, char* pUtf16Buffer, nuint elementCount) 
        
           { 
        
               Debug.Assert(Vector256.IsHardwareAccelerated); 
        
               Debug.Assert(BitConverter.IsLittleEndian); 
        
               Debug.Assert(elementCount >= 2 * (uint)Vector256<byte>.Count);

And the only place where it's called from has exactly the same guards:

runtime/src/libraries/System.Private.CoreLib/src/System/Text/ASCIIUtility.cs

Lines 1581 to 1584 in 20a2f23

    
           if (BitConverter.IsLittleEndian && Vector256.IsHardwareAccelerated && elementCount >= 2 * (uint)Vector256<byte>.Count) 
        
           { 
        
               currentOffset = WidenAsciiToUtf16_Vector256(pAsciiBuffer, pUtf16Buffer, elementCount); 
        
           }

So the asserts mentioned above should definitely not fail.

I was able to reproduce the failure locally. It's gone when I remove those 3 asserts! What am I missing?

danmoseley · 2022-08-21T18:20:29Z

Http test crash. Worth cracking the dump to check it's not related to this change?

davidwrighton · 2022-08-23T00:13:21Z

@adamsitnik please, don't just comment that assert out. The generated code will not reliably behave correctly, you need to follow the instructions I put in #74253

adamsitnik · 2022-08-23T12:35:09Z

@adamsitnik please, don't just comment that assert out. The generated code will not reliably behave correctly, you need to follow the instructions I put in #74253

@davidwrighton thank you!

btw based on the perf results I decided to simply inline these two helpers

Method	Toolchain	size	encName	Mean	Ratio
GetString	\helpers\corerun.exe	8	ascii	18.53 ns	0.88
GetString	\inlined\corerun.exe	8	ascii	17.47 ns	0.83
GetString	\main\corerun.exe	8	ascii	21.16 ns	1.00

GetString	\helpers\corerun.exe	8	utf-8	16.84 ns	0.80
GetString	\inlined\corerun.exe	8	utf-8	15.96 ns	0.76
GetString	\main\corerun.exe	8	utf-8	20.96 ns	1.00

GetString	\helpers\corerun.exe	16	ascii	24.87 ns	0.83
GetString	\inlined\corerun.exe	16	ascii	18.32 ns	0.61
GetString	\main\corerun.exe	16	ascii	30.11 ns	1.00

GetString	\helpers\corerun.exe	16	utf-8	18.20 ns	0.77
GetString	\inlined\corerun.exe	16	utf-8	16.27 ns	0.69
GetString	\main\corerun.exe	16	utf-8	23.71 ns	1.00

GetString	\helpers\corerun.exe	32	ascii	18.50 ns	0.61
GetString	\inlined\corerun.exe	32	ascii	17.07 ns	0.56
GetString	\main\corerun.exe	32	ascii	30.56 ns	1.00

GetString	\helpers\corerun.exe	32	utf-8	20.95 ns	0.84
GetString	\inlined\corerun.exe	32	utf-8	18.00 ns	0.72
GetString	\main\corerun.exe	32	utf-8	24.88 ns	1.00

GetString	\helpers\corerun.exe	64	ascii	21.41 ns	0.60
GetString	\inlined\corerun.exe	64	ascii	20.59 ns	0.58
GetString	\main\corerun.exe	64	ascii	35.69 ns	1.00

GetString	\helpers\corerun.exe	64	utf-8	30.83 ns	1.03
GetString	\inlined\corerun.exe	64	utf-8	28.59 ns	0.96
GetString	\main\corerun.exe	64	utf-8	29.89 ns	1.00

GetString	\helpers\corerun.exe	512	ascii	55.39 ns	0.72
GetString	\inlined\corerun.exe	512	ascii	53.79 ns	0.70
GetString	\main\corerun.exe	512	ascii	77.35 ns	1.00

GetString	\helpers\corerun.exe	512	utf-8	71.63 ns	0.89
GetString	\inlined\corerun.exe	512	utf-8	69.81 ns	0.87
GetString	\main\corerun.exe	512	utf-8	80.38 ns	1.00

adamsitnik · 2022-08-23T18:31:28Z

Updated perf numbers:

For size==16 we can observe gain for both x64 and arm64. It's caused by executing the vectorized code path now (previously the buffer needed to be at least double of Vector128<byte>.Count (32)).

x64 AVX2

There is a 10-30% boost for large inputs. It's caused by adding the Vector256 code path.

Small inputs also work faster, partially because WidenFourAsciiBytesToUtf16AndWriteToBuffer is producing better code gen now. I am afraid that some of these gains are caused by code alignment changes (the benchmarks themselves were run with memory randomization enabled, so we can exclude memory alignment from the list).

BenchmarkDotNet=v0.13.1.20220823-develop, OS=Windows 11 (10.0.22000.856/21H2)
AMD Ryzen Threadripper PRO 3945WX 12-Cores, 1 CPU, 24 logical and 12 physical cores
.NET SDK=7.0.100-preview.7.22377.5
  [Host]     : .NET 7.0.0 (7.0.22.37506), X64 RyuJIT AVX2
  Job-SJSFRM : .NET 8.0.0 (42.42.42.42424), X64 RyuJIT AVX2
  Job-QAHUTQ : .NET 8.0.0 (42.42.42.42424), X64 RyuJIT AVX2

LaunchCount=9 MemoryRandomization=True

Method	Toolchain	size	encName	Mean	Ratio
GetString	\main\corerun.exe	8	ascii	24.70 ns	1.00
GetString	\pr\corerun.exe	8	ascii	16.86 ns	0.69

GetString	\main\corerun.exe	8	utf-8	19.65 ns	1.00
GetString	\pr\corerun.exe	8	utf-8	16.48 ns	0.84

GetString	\main\corerun.exe	16	ascii	27.84 ns	1.00
GetString	\pr\corerun.exe	16	ascii	18.36 ns	0.66

GetString	\main\corerun.exe	16	utf-8	21.78 ns	1.00
GetString	\pr\corerun.exe	16	utf-8	17.26 ns	0.79

GetString	\main\corerun.exe	32	ascii	30.05 ns	1.00
GetString	\pr\corerun.exe	32	ascii	17.09 ns	0.57

GetString	\main\corerun.exe	32	utf-8	23.42 ns	1.00
GetString	\pr\corerun.exe	32	utf-8	17.59 ns	0.75

GetString	\main\corerun.exe	64	ascii	33.73 ns	1.00
GetString	\pr\corerun.exe	64	ascii	20.59 ns	0.61

GetString	\main\corerun.exe	64	utf-8	26.96 ns	1.00
GetString	\pr\corerun.exe	64	utf-8	27.51 ns	1.02

GetString	\main\corerun.exe	512	ascii	75.27 ns	1.00
GetString	\pr\corerun.exe	512	ascii	52.61 ns	0.70

GetString	\main\corerun.exe	512	utf-8	76.35 ns	1.00
GetString	\pr\corerun.exe	512	utf-8	67.73 ns	0.89

ARM64 AdvSIMD

There is a small (4-8%) gain for all test cases.

BenchmarkDotNet=v0.13.1.1845-nightly, OS=Windows 11 (10.0.22622.575)
Microsoft SQ1 3.0 GHz, 1 CPU, 8 logical and 8 physical cores
.NET SDK=7.0.100-rc.2.22422.7
  [Host]     : .NET 7.0.0 (7.0.22.41112), Arm64 RyuJIT AdvSIMD
  Job-AHYKGY : .NET 8.0.0 (42.42.42.42424), Arm64 RyuJIT AdvSIMD
  Job-THHHUV : .NET 8.0.0 (42.42.42.42424), Arm64 RyuJIT AdvSIMD

LaunchCount=9 MemoryRandomization=True

Method	Toolchain	size	encName	Mean	Ratio
GetString	\main\corerun.exe	8	ascii	29.61 ns	1.00
GetString	\pr\corerun.exe	8	ascii	28.39 ns	0.96

GetString	\main\corerun.exe	8	utf-8	29.42 ns	1.00
GetString	\pr\corerun.exe	8	utf-8	28.39 ns	0.96

GetString	\main\corerun.exe	16	ascii	33.19 ns	1.00
GetString	\pr\corerun.exe	16	ascii	30.45 ns	0.92

GetString	\main\corerun.exe	16	utf-8	33.15 ns	1.00
GetString	\pr\corerun.exe	16	utf-8	29.42 ns	0.89

GetString	\main\corerun.exe	32	ascii	39.20 ns	1.00
GetString	\pr\corerun.exe	32	ascii	34.30 ns	0.88

GetString	\main\corerun.exe	32	utf-8	37.09 ns	1.00
GetString	\pr\corerun.exe	32	utf-8	33.67 ns	0.91

GetString	\main\corerun.exe	64	ascii	47.49 ns	1.00
GetString	\pr\corerun.exe	64	ascii	41.39 ns	0.87

GetString	\main\corerun.exe	64	utf-8	50.62 ns	1.00
GetString	\pr\corerun.exe	64	utf-8	44.74 ns	0.89

GetString	\main\corerun.exe	512	ascii	162.25 ns	1.00
GetString	\pr\corerun.exe	512	ascii	148.39 ns	0.92

GetString	\main\corerun.exe	512	utf-8	199.57 ns	1.00
GetString	\pr\corerun.exe	512	utf-8	182.53 ns	0.92

stephentoub · 2022-09-08T15:20:54Z

src/libraries/System.Private.CoreLib/src/System/Text/ASCIIUtility.cs

+                    {
+                        Vector256<byte> asciiVector = Vector256.Load(pAsciiBuffer + currentOffset);
+
+                        if (asciiVector.ExtractMostSignificantBits() != 0)


https://github.com/dotnet/runtime/pull/73055/files?diff=split&w=1#diff-66bbe89271f826c9232bd146abb678844754515dc027f70ad0ce36f751da46ebR1378 suggests that Sse41.TestZ is faster than ExtractMostSignificantBits for 128 bits. Does the same not hold for Avx.TestZ for 256 bits?

It does not:

using BenchmarkDotNet.Attributes; using BenchmarkDotNet.Running; using System.Reflection.Metadata; using System.Runtime.InteropServices; using System.Runtime.Intrinsics; using System.Runtime.Intrinsics.X86; namespace VectorBenchmarks { internal class Program { static void Main(string[] args) => BenchmarkSwitcher.FromAssembly(typeof(Program).Assembly).Run(args); } public unsafe class ContainsNonAscii { private const int Size = 1024; private byte* _bytes; [GlobalSetup] public void Setup() { _bytes = (byte*)NativeMemory.AlignedAlloc(new UIntPtr(Size), new UIntPtr(32)); new Span<byte>(_bytes, Size).Clear(); } [GlobalCleanup] public void Free() => NativeMemory.AlignedFree(_bytes); [Benchmark] public bool ExtractMostSignificantBits() { ref byte searchSpace = ref *_bytes; nuint currentOffset = 0; nuint finalOffsetWhereCanRunLoop = Size - (uint)Vector256<byte>.Count; do { Vector256<byte> asciiVector = Vector256.LoadUnsafe(ref searchSpace, currentOffset); if (asciiVector.ExtractMostSignificantBits() != 0) { return true; } currentOffset += (nuint)Vector256<byte>.Count; } while (currentOffset <= finalOffsetWhereCanRunLoop); return false; } [Benchmark] public bool TestZ() { ref byte searchSpace = ref *_bytes; nuint currentOffset = 0; nuint finalOffsetWhereCanRunLoop = Size - (uint)Vector256<byte>.Count; do { Vector256<byte> asciiVector = Vector256.LoadUnsafe(ref searchSpace, currentOffset); if (!Avx.TestZ(asciiVector, Vector256.Create((byte)0x80))) { return true; } currentOffset += (nuint)Vector256<byte>.Count; } while (currentOffset <= finalOffsetWhereCanRunLoop); return false; } } }

BenchmarkDotNet=v0.13.2, OS=Windows 11 (10.0.22000.856/21H2) AMD Ryzen Threadripper PRO 3945WX 12-Cores, 1 CPU, 24 logical and 12 physical cores .NET SDK=7.0.100-rc.1.22423.16 [Host] : .NET 7.0.0 (7.0.22.42223), X64 RyuJIT AVX2 DefaultJob : .NET 7.0.0 (7.0.22.42223), X64 RyuJIT AVX2

Method Mean Error StdDev Code Size

ExtractMostSignificantBits 11.78 ns 0.042 ns 0.040 ns 57 B

TestZ 14.76 ns 0.320 ns 0.416 ns 68 B

.NET 7.0.0 (7.0.22.42223), X64 RyuJIT AVX2

; VectorBenchmarks.ContainsNonAscii.ExtractMostSignificantBits() vzeroupper mov rax,[rcx+8] xor edx,edx nop dword ptr [rax] M00_L00: vmovdqu ymm0,ymmword ptr [rax+rdx] vpmovmskb ecx,ymm0 test ecx,ecx jne short M00_L01 add rdx,20 cmp rdx,3E0 jbe short M00_L00 xor eax,eax vzeroupper ret M00_L01: mov eax,1 vzeroupper ret ; Total bytes of code 57

.NET 7.0.0 (7.0.22.42223), X64 RyuJIT AVX2

; VectorBenchmarks.ContainsNonAscii.TestZ() vzeroupper mov rax,[rcx+8] xor edx,edx vmovupd ymm0,[7FF9E3D94D60] nop dword ptr [rax] nop dword ptr [rax] M00_L00: vptest ymm0,ymmword ptr [rax+rdx] jne short M00_L01 add rdx,20 cmp rdx,3E0 jbe short M00_L00 xor eax,eax vzeroupper ret M00_L01: mov eax,1 vzeroupper ret ; Total bytes of code 68

It does not:

That's not what I see on my machine.

Method Mean

ExtractMostSignificantBits_128 31.77 ns

TestZ_128 25.58 ns

ExtractMostSignificantBits_256 15.58 ns

TestZ_256 11.66 ns

using BenchmarkDotNet.Attributes; using BenchmarkDotNet.Running; using System; using System.Runtime.InteropServices; using System.Runtime.Intrinsics.X86; using System.Runtime.Intrinsics; [HideColumns("Error", "StdDev", "Median", "RatioSD")] public unsafe partial class Program { static void Main(string[] args) => BenchmarkSwitcher.FromAssembly(typeof(Program).Assembly).Run(args); private const int Size = 1024; private byte* _bytes; [GlobalSetup] public void Setup() { _bytes = (byte*)NativeMemory.AlignedAlloc(new UIntPtr(Size), new UIntPtr(32)); new Span<byte>(_bytes, Size).Clear(); } [GlobalCleanup] public void Free() => NativeMemory.AlignedFree(_bytes); [Benchmark] public bool ExtractMostSignificantBits_128() { ref byte searchSpace = ref *_bytes; nuint currentOffset = 0; nuint finalOffsetWhereCanRunLoop = Size - (uint)Vector128<byte>.Count; do { Vector128<byte> asciiVector = Vector128.LoadUnsafe(ref searchSpace, currentOffset); if (asciiVector.ExtractMostSignificantBits() != 0) { return true; } currentOffset += (nuint)Vector128<byte>.Count; } while (currentOffset <= finalOffsetWhereCanRunLoop); return false; } [Benchmark] public bool TestZ_128() { ref byte searchSpace = ref *_bytes; nuint currentOffset = 0; nuint finalOffsetWhereCanRunLoop = Size - (uint)Vector128<byte>.Count; do { Vector128<byte> asciiVector = Vector128.LoadUnsafe(ref searchSpace, currentOffset); if (!Sse41.TestZ(asciiVector, Vector128.Create((byte)0x80))) { return true; } currentOffset += (nuint)Vector128<byte>.Count; } while (currentOffset <= finalOffsetWhereCanRunLoop); return false; } [Benchmark] public bool ExtractMostSignificantBits_256() { ref byte searchSpace = ref *_bytes; nuint currentOffset = 0; nuint finalOffsetWhereCanRunLoop = Size - (uint)Vector256<byte>.Count; do { Vector256<byte> asciiVector = Vector256.LoadUnsafe(ref searchSpace, currentOffset); if (asciiVector.ExtractMostSignificantBits() != 0) { return true; } currentOffset += (nuint)Vector256<byte>.Count; } while (currentOffset <= finalOffsetWhereCanRunLoop); return false; } [Benchmark] public bool TestZ_256() { ref byte searchSpace = ref *_bytes; nuint currentOffset = 0; nuint finalOffsetWhereCanRunLoop = Size - (uint)Vector256<byte>.Count; do { Vector256<byte> asciiVector = Vector256.LoadUnsafe(ref searchSpace, currentOffset); if (!Avx.TestZ(asciiVector, Vector256.Create((byte)0x80))) { return true; } currentOffset += (nuint)Vector256<byte>.Count; } while (currentOffset <= finalOffsetWhereCanRunLoop); return false; } }

@stephentoub I think it depends on CPU, I even had to revert TestZ from Vector.Equals because it produced regressions #67902

I think it depends on CPU, I even had to revert TestZ from Vector.Equals because it produced regressions #67902

That change reverted it from both the 256-bit and 128-bit code paths. This PR uses TestZ for 128-bit. Why is that ok?

I'm questioning the non-symmetrical usage.

There are some helpful comments in these SO links (and related) https://stackoverflow.com/questions/60446759/sse2-test-xmm-bitmask-directly-without-using-pmovmskb

Thanks, but I'm not seeing the answer there to my question.

I'll restate:
This PR is adding additional code to prefer using TestZ with Vector128:
https://github.com/dotnet/runtime/pull/73055/files#diff-66bbe89271f826c9232bd146abb678844754515dc027f70ad0ce36f751da46ebR1379-R1391
Your #67902 reverted other changes that preferred using TestZ, not just on 256-bit but also on 128-bit vectors.
Does it still make sense for this PR to be adding additional code to use TestZ with Vector128?

(Part of why I'm pushing on this is with a goal of avoiding needing to drop down to direct instrinsics as much as possible. I'd hope we can get to a point where the obvious code to write is the best code to write in as many situations as possible.)

Right, I am not saying the current non-symmetrical usage is correct, I'd probably change both to use ExtractMostSignificantBits

C++ compilers also do different things here, e.g. LLVM folds even direct MoveMask usage to testz: https://godbolt.org/z/MobvxvzGK

What is "better" is going to depend on a few factors.

On x86/x64, ExtractMostSignificantBits is likely faster because it can always emit exactly as movmsk and there are some CPUs where TestZ can be slower, particularly for "small inputs" where the match is sooner. When the match is later, TestZ typically wins out regardless.

On Arm64, doing the early comparison against == Zero is likely better because it is a single instruction vs the multi-instruction sequence required to emulate x64's movmsk.

I think the best choice here is to use == Zero (and therefore TestZ) as I believe it will, on average, produce the best/most consistent code. The cases where it might be a bit slower will typically be for smaller inputs where we're already returning quickly and the extra couple nanoseconds won't really matter.

stephentoub · 2022-09-08T15:25:30Z

src/libraries/System.Private.CoreLib/src/System/Text/ASCIIUtility.cs

+
+                        currentOffset += (nuint)Vector256<byte>.Count;
+                        pCurrentWriteAddress += (nuint)Vector256<byte>.Count;
+                    } while (currentOffset <= finalOffsetWhereCanRunLoop);


On every iteration of the loop we're bumping currentOffset and then also adding that to pAsciiBuffer. Would it be faster to instead compute the upper bound as an address, just bump the current pointer in the loop, and then after the loop compute the currentOffset if needed based on the ending/starting difference?

It produces better codegen, but the reported time difference is within the range of error (0-0.3ns gain or loss)

public unsafe class Widen { private const int Size = 1024; private byte* _bytes; private char* _chars; [GlobalSetup] public void Setup() { _bytes = (byte*)NativeMemory.AlignedAlloc(new UIntPtr(Size), new UIntPtr(32)); new Span<byte>(_bytes, Size).Fill((byte)'a'); _chars = (char*)NativeMemory.AlignedAlloc(new UIntPtr(Size * sizeof(char)), new UIntPtr(32)); } [GlobalCleanup] public void Free() { NativeMemory.AlignedFree(_bytes); NativeMemory.AlignedFree(_chars); } [Benchmark] public void Current() { ref byte searchSpace = ref *_bytes; ushort* pCurrentWriteAddress = (ushort*)_chars; nuint currentOffset = 0; nuint finalOffsetWhereCanRunLoop = Size - (uint)Vector256<byte>.Count; do { Vector256<byte> asciiVector = Vector256.Load(_bytes + currentOffset); if (asciiVector.ExtractMostSignificantBits() != 0) { break; } (Vector256<ushort> low, Vector256<ushort> upper) = Vector256.Widen(asciiVector); low.Store(pCurrentWriteAddress); upper.Store(pCurrentWriteAddress + Vector256<ushort>.Count); currentOffset += (nuint)Vector256<byte>.Count; pCurrentWriteAddress += (nuint)Vector256<byte>.Count; } while (currentOffset <= finalOffsetWhereCanRunLoop); } [Benchmark] public void Suggested() { ref byte currentSearchSpace = ref *_bytes; ref ushort currentWriteAddress = ref Unsafe.As<char, ushort>(ref *_chars); ref byte oneVectorAwayFromEnd = ref Unsafe.Add(ref currentSearchSpace, Size - Vector256<byte>.Count); do { Vector256<byte> asciiVector = Vector256.LoadUnsafe(ref currentSearchSpace); if (asciiVector.ExtractMostSignificantBits() != 0) { break; } (Vector256<ushort> low, Vector256<ushort> upper) = Vector256.Widen(asciiVector); low.StoreUnsafe(ref currentWriteAddress); upper.StoreUnsafe(ref currentWriteAddress, (nuint)Vector256<ushort>.Count); currentSearchSpace = ref Unsafe.Add(ref currentSearchSpace, Vector256<byte>.Count); currentWriteAddress = ref Unsafe.Add(ref currentWriteAddress, Vector256<byte>.Count); } while (!Unsafe.IsAddressGreaterThan(ref currentSearchSpace, ref oneVectorAwayFromEnd)); } }

BenchmarkDotNet=v0.13.2, OS=Windows 11 (10.0.22000.856/21H2) AMD Ryzen Threadripper PRO 3945WX 12-Cores, 1 CPU, 24 logical and 12 physical cores .NET SDK=7.0.100-rc.1.22423.16 [Host] : .NET 7.0.0 (7.0.22.42223), X64 RyuJIT AVX2 DefaultJob : .NET 7.0.0 (7.0.22.42223), X64 RyuJIT AVX2

Method Mean Error StdDev Code Size

Current 44.29 ns 0.171 ns 0.152 ns 81 B

Suggested 44.54 ns 0.042 ns 0.032 ns 77 B

.NET 7.0.0 (7.0.22.42223), X64 RyuJIT AVX2

; VectorBenchmarks.Widen.Current() vzeroupper mov eax,[rcx+8] mov rax,[rcx+10] xor edx,edx M00_L00: mov r8,[rcx+8] vmovdqu ymm0,ymmword ptr [r8+rdx] vpmovmskb r8d,ymm0 test r8d,r8d jne short M00_L01 vmovaps ymm1,ymm0 vpmovzxbw ymm1,xmm1 vextractf128 xmm0,ymm0,1 vpmovzxbw ymm0,xmm0 vmovdqu ymmword ptr [rax],ymm1 vmovdqu ymmword ptr [rax+20],ymm0 add rdx,20 add rax,40 cmp rdx,3E0 jbe short M00_L00 M00_L01: vzeroupper ret ; Total bytes of code 81

.NET 7.0.0 (7.0.22.42223), X64 RyuJIT AVX2

; VectorBenchmarks.Widen.Suggested() vzeroupper mov rax,[rcx+8] mov rdx,[rcx+10] lea rcx,[rax+3E0] M00_L00: vmovdqu ymm0,ymmword ptr [rax] vpmovmskb r8d,ymm0 test r8d,r8d jne short M00_L01 vmovaps ymm1,ymm0 vpmovzxbw ymm1,xmm1 vextractf128 xmm0,ymm0,1 vpmovzxbw ymm0,xmm0 vmovdqu ymmword ptr [rdx],ymm1 vmovdqu ymmword ptr [rdx+20],ymm0 add rax,20 add rdx,40 cmp rax,rcx jbe short M00_L00 M00_L01: vzeroupper ret ; Total bytes of code 77

If you don't mind I am going to merge it as it is and apply your suggestion in my next PR.

stephentoub · 2022-09-08T15:27:18Z

src/libraries/System.Private.CoreLib/src/System/Text/ASCIIUtility.cs

+                            break;
+                        }
+
+                        // Vector128.Widen is not used here as it less performant on ARM64


Do we know why? Naively I'd expect the JIT to be able to produce the same code for both.

I am sorry, but I don't. It's not that I did not try to find out, it's the arm64 tooling that makes it hard for me.

@tannergooding?

If this is by design, ok. But if it's something we can/should be fixing in the JIT, I want to make sure we're not sweeping such issues under the rug. Ideally the obvious code is also the best performing code.

Codegen looks good to me:

(add could be contained but it's unrelated here)

I'd need a comparison between the two code paths to see where the difference is.

I would expect these to be identical except for a case where the original code was making some assumption (based on knowing the inputs were restricted to a subset of all possible values) and therefore skipping an otherwise "required" step that would be necessary to ensure deterministic results for "any input".

stephentoub · 2022-09-08T15:28:13Z

src/libraries/System.Private.CoreLib/src/System/Text/ASCIIUtility.cs

+
+                        currentOffset += (nuint)Vector128<byte>.Count;
+                        pCurrentWriteAddress += (nuint)Vector128<byte>.Count;
+                    } while (currentOffset <= finalOffsetWhereCanRunLoop);
                }
            }
            else if (Vector.IsHardwareAccelerated)


Why is the Vector<T> path still needed?

Why is the Vector path still needed?

Some Mono variants don't support Vector128 for all configs yet

Some Mono variants don't support Vector128 for all configs yet

Which support Vector and not Vector128?

Just the presence of these paths are keeping the methods from being R2R'd it seems.

Mono-LLVM supports both, Mono without LLVM (e.g. default Mono jit mode or AOT) supports only Vector<>

Mono-LLVM supports both, Mono without LLVM (e.g. default Mono jit mode or AOT) supports only Vector<>

Is that getting fixed?

We now have multiple vectorized implementations that don't have a Vector<T> code path. Why is this one special that it still needs one?

AndyAyersMS · 2022-09-14T01:26:02Z

Likely improvements:

[Perf] Windows/x64: 13 Improvements on 9/9/2022 1:01:45 PM perf-autofiling-issues#8539 (subset, some are probably noise)
[Perf] Windows/x64: 2 Improvements on 9/8/2022 11:29:08 PM perf-autofiling-issues#8538
[Perf] Windows/x86: 28 Improvements on 9/9/2022 1:01:45 PM perf-autofiling-issues#8513
[Perf] Windows/arm64: 4 Improvements on 9/9/2022 7:27:51 AM perf-autofiling-issues#8571
[Perf] Linux/arm64: 5 Improvements on 9/9/2022 7:27:51 AM perf-autofiling-issues#8579

adamsitnik added 3 commits July 29, 2022 11:21

port WidenFourAsciiBytesToUtf16AndWriteToBuffer

80ebdc4

port WidenAsciiToUtf16*

c37ef61

enforce the inlining of Vector128.WidenVector128<byte>) to remove per…

53cdf1d

…formance regression

adamsitnik added the area-System.Text.Encoding label Jul 29, 2022

ghost assigned adamsitnik Jul 29, 2022

adamsitnik added 3 commits July 29, 2022 13:23

don't use Vector128.Widen as it less optimal on ARM64

a3b10cb

remove dead code: the condition was impossible to meet (jump to NonAs…

865721c

…ciiDataSeenInInnerLoop was performed only when containsNonAsciiBytes was true)

simplify the code and get perf boost on both x64 and ARM64

349aca7

adamsitnik force-pushed the WidenAsciiToUtf16 branch from 9d653d5 to 26c0c9d Compare July 29, 2022 12:16

add Vector256 code path

764d038

adamsitnik force-pushed the WidenAsciiToUtf16 branch from 26c0c9d to 764d038 Compare July 29, 2022 12:20

tannergooding reviewed Jul 29, 2022

View reviewed changes

EgorBo reviewed Jul 29, 2022

View reviewed changes

src/libraries/System.Private.CoreLib/src/System/Text/ASCIIUtility.cs Outdated Show resolved Hide resolved

adamsitnik commented Jul 29, 2022

View reviewed changes

This was referenced Jul 29, 2022

System.Net.* cancellation tests fail in CI #72818

Closed

system.collections.concurrent.tests failed in CI #73038

Closed

adamsitnik added 5 commits August 19, 2022 09:54

Merge remote-tracking branch 'upstream/main' into WidenAsciiToUtf16

f0b227f

don't use Vector128.ExtractMostSignificantBits as it's expensive on ARM

5d7ef96

these methods are now getting inlined

a84c712

fix the size check

e44f2a7

prefer arm-specific instructions to avoid regressions for smaller inp…

20a2f23

…uts on ARM

adamsitnik mentioned this pull request Aug 19, 2022

[R2R] Debug assert failing for condition that is definitely true #74253

Closed

comment out the problematic assert

bc0b8be

adamsitnik added 2 commits August 22, 2022 09:27

Merge branch 'dotnet:main' into WidenAsciiToUtf16

7ee2c7b

it's beneficial to do it for fewer elements as well

5325199

inline the helper methods to simply get better perf and avoid R2R issue

5501301

adamsitnik marked this pull request as ready for review August 23, 2022 18:31

adamsitnik requested a review from tannergooding August 23, 2022 18:32

adamsitnik mentioned this pull request Sep 8, 2022

New ASCII APIs #75012

Merged

karelz mentioned this pull request Sep 8, 2022

Failures in SSL Stream tests: There are no more endpoints available in the endpoint mapper. #74838

Closed

stephentoub reviewed Sep 8, 2022

View reviewed changes

stephentoub approved these changes Sep 8, 2022

View reviewed changes

adamsitnik merged commit 9d53701 into dotnet:main Sep 9, 2022

adamsitnik added this to the 8.0.0 milestone Sep 9, 2022

adamsitnik mentioned this pull request Sep 9, 2022

Switch from direct intrinsics usage to Vector/Vector64/Vector128/Vector256 #64451

Open

75 tasks

This was referenced Sep 13, 2022

[Perf] Windows/x64: 34 Regressions on 9/9/2022 1:01:45 PM #75544

Closed

Regressions in System.IO.Tests.Perf_Path #75548

Closed

ghost locked as resolved and limited conversation to collaborators Oct 15, 2022


		// Can we at least widen the first part of the vector?

		if (!containsNonAsciiBytes)

	if (containsNonAsciiBytes)
	{
	// non-ASCII byte somewhere
	goto NonAsciiDataSeenInInnerLoop;
	}


		// Calculate how many elements we wrote in order to get pOutputBuffer to its next alignment

Method	Mean	Error	StdDev	Code Size
ExtractMostSignificantBits	11.78 ns	0.042 ns	0.040 ns	57 B
TestZ	14.76 ns	0.320 ns	0.416 ns	68 B

Method	Mean
ExtractMostSignificantBits_128	31.77 ns
TestZ_128	25.58 ns
ExtractMostSignificantBits_256	15.58 ns
TestZ_256	11.66 ns

Method	Mean	Error	StdDev	Code Size
Current	44.29 ns	0.171 ns	0.152 ns	81 B
Suggested	44.54 ns	0.042 ns	0.032 ns	77 B

try to port ASCIIUtility.WidenAsciiToUtf16 to x-plat intrinsics #73055

try to port ASCIIUtility.WidenAsciiToUtf16 to x-plat intrinsics #73055

Conversation

adamsitnik commented Jul 29, 2022 • edited Loading

x64

ARM64

ghost commented Jul 29, 2022

x64

ARM64

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

adamsitnik commented Jul 29, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

adamsitnik commented Aug 19, 2022 • edited Loading

danmoseley commented Aug 21, 2022

davidwrighton commented Aug 23, 2022

adamsitnik commented Aug 23, 2022

adamsitnik commented Aug 23, 2022

x64 AVX2

ARM64 AdvSIMD

stephentoub Sep 8, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

.NET 7.0.0 (7.0.22.42223), X64 RyuJIT AVX2

.NET 7.0.0 (7.0.22.42223), X64 RyuJIT AVX2

Choose a reason for hiding this comment

EgorBo Sep 9, 2022 • edited Loading

Choose a reason for hiding this comment

stephentoub Sep 9, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

stephentoub Sep 9, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tannergooding Sep 9, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

.NET 7.0.0 (7.0.22.42223), X64 RyuJIT AVX2

.NET 7.0.0 (7.0.22.42223), X64 RyuJIT AVX2

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tannergooding Sep 9, 2022 • edited Loading

Choose a reason for hiding this comment

stephentoub Sep 8, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

AndyAyersMS commented Sep 14, 2022 • edited by EgorBo Loading

adamsitnik commented Jul 29, 2022 •

edited

Loading

adamsitnik commented Aug 19, 2022 •

edited

Loading

stephentoub Sep 8, 2022 •

edited

Loading

EgorBo Sep 9, 2022 •

edited

Loading

stephentoub Sep 9, 2022 •

edited

Loading

stephentoub Sep 9, 2022 •

edited

Loading

tannergooding Sep 9, 2022 •

edited

Loading

tannergooding Sep 9, 2022 •

edited

Loading

stephentoub Sep 8, 2022 •

edited

Loading

AndyAyersMS commented Sep 14, 2022 •

edited by EgorBo

Loading