Vectorize the CRC64 implementation #85221

brantburnett · 2023-04-23T13:36:49Z

This significantly improves performance for System.IO.Hashing.Crc64 for cases where the source span is 16 bytes or larger on Intel x86/x64 and modern ARM architectures. The vectorization change only applies to .NET 7 and later targets of System.IO.Hashing because it uses some Vector128 APIs added in .NET 7.

This is a continuation of work done in #83321 which added vectorization to CRC32.

BenchmarkDotNet=v0.13.2.2052-nightly, OS=Windows 11 (10.0.22621.1631)
Intel Core i7-10850H CPU 2.70GHz, 1 CPU, 12 logical and 6 physical cores
.NET SDK=8.0.100-preview.3.23178.7
[Host] : .NET 8.0.0 (8.0.23.17408), X64 RyuJIT AVX2
Job-FPBBMO : .NET 8.0.0 (42.42.42.42424), X64 RyuJIT AVX2
Job-FTHZKV : .NET 8.0.0 (42.42.42.42424), X64 RyuJIT AVX2

PowerPlanMode=00000000-0000-0000-0000-000000000000 IterationTime=250.0000 ms MaxIterationCount=20
MinIterationCount=15 WarmupCount=1

Method	Job	BufferSize	Mean	Error	StdDev	Median	Min	Max	Ratio
Append	Scalar	16	30.529 ns	0.3658 ns	0.3422 ns	30.656 ns	29.968 ns	30.986 ns	1.00
Append	Vector	16	6.056 ns	0.0193 ns	0.0181 ns	6.046 ns	6.035 ns	6.089 ns	0.20

Append	Scalar	256	496.097 ns	2.8740 ns	2.5477 ns	496.725 ns	488.920 ns	499.684 ns	1.00
Append	Vector	256	14.741 ns	0.0614 ns	0.0512 ns	14.753 ns	14.589 ns	14.798 ns	0.03

Append	Scalar	1024	1,986.624 ns	11.4688 ns	10.7279 ns	1,984.088 ns	1,971.231 ns	2,004.409 ns	1.00
Append	Vector	1024	44.201 ns	0.1042 ns	0.0924 ns	44.196 ns	44.062 ns	44.413 ns	0.02

BenchmarkDotNet=v0.13.2.2052-nightly, OS=ubuntu 22.04
AWS m6g.xlarge Graviton2
.NET SDK=8.0.100-preview.3.23178.7
[Host] : .NET 8.0.0 (8.0.23.17408), Arm64 RyuJIT AdvSIMD
Job-OYJLBY : .NET 8.0.0 (42.42.42.42424), Arm64 RyuJIT AdvSIMD
Job-GKZVCN : .NET 8.0.0 (42.42.42.42424), Arm64 RyuJIT AdvSIMD

PowerPlanMode=00000000-0000-0000-0000-000000000000 IterationTime=250.0000 ms MaxIterationCount=20
MinIterationCount=15 WarmupCount=1

Method	Job	BufferSize	Mean	Error	StdDev	Median	Min	Max	Ratio
Append	Scalar	16	38.44 ns	0.003 ns	0.002 ns	38.44 ns	38.43 ns	38.44 ns	1.00
Append	Vector	16	13.57 ns	0.002 ns	0.002 ns	13.57 ns	13.56 ns	13.57 ns	0.35

Append	Scalar	256	619.02 ns	0.143 ns	0.133 ns	619.00 ns	618.85 ns	619.35 ns	1.00
Append	Vector	256	28.70 ns	0.025 ns	0.021 ns	28.70 ns	28.68 ns	28.75 ns	0.05

Append	Scalar	1024	2,465.47 ns	0.183 ns	0.153 ns	2,465.44 ns	2,465.18 ns	2,465.68 ns	1.00
Append	Vector	1024	79.71 ns	0.104 ns	0.097 ns	79.68 ns	79.58 ns	79.93 ns	0.03

ghost · 2023-04-23T13:37:04Z

Tagging subscribers to this area: @dotnet/area-system-io
See info in area-owners.md if you want to be subscribed.

Issue Details

This significantly improves performance for System.IO.Hashing.Crc64 for cases where the source span is 16 bytes or larger on Intel x86/x64 and modern ARM architectures. The vectorization change only applies to .NET 7 and later targets of System.IO.Hashing because it uses some Vector128 APIs added in .NET 7.

This is a continuation of work done in #83321 which added vectorization to CRC32.

BenchmarkDotNet=v0.13.2.2052-nightly, OS=Windows 11 (10.0.22621.1631)
Intel Core i7-10850H CPU 2.70GHz, 1 CPU, 12 logical and 6 physical cores
.NET SDK=8.0.100-preview.3.23178.7
[Host] : .NET 8.0.0 (8.0.23.17408), X64 RyuJIT AVX2
Job-FPBBMO : .NET 8.0.0 (42.42.42.42424), X64 RyuJIT AVX2
Job-FTHZKV : .NET 8.0.0 (42.42.42.42424), X64 RyuJIT AVX2

PowerPlanMode=00000000-0000-0000-0000-000000000000 IterationTime=250.0000 ms MaxIterationCount=20
MinIterationCount=15 WarmupCount=1

Method	Job	BufferSize	Mean	Error	StdDev	Median	Min	Max	Ratio
Append	Scalar	16	30.529 ns	0.3658 ns	0.3422 ns	30.656 ns	29.968 ns	30.986 ns	1.00
Append	Vector	16	6.056 ns	0.0193 ns	0.0181 ns	6.046 ns	6.035 ns	6.089 ns	0.20

Append	Scalar	256	496.097 ns	2.8740 ns	2.5477 ns	496.725 ns	488.920 ns	499.684 ns	1.00
Append	Vector	256	14.741 ns	0.0614 ns	0.0512 ns	14.753 ns	14.589 ns	14.798 ns	0.03

Append	Scalar	1024	1,986.624 ns	11.4688 ns	10.7279 ns	1,984.088 ns	1,971.231 ns	2,004.409 ns	1.00
Append	Vector	1024	44.201 ns	0.1042 ns	0.0924 ns	44.196 ns	44.062 ns	44.413 ns	0.02

BenchmarkDotNet=v0.13.2.2052-nightly, OS=ubuntu 22.04
AWS m6g.xlarge Graviton2
.NET SDK=8.0.100-preview.3.23178.7
[Host] : .NET 8.0.0 (8.0.23.17408), Arm64 RyuJIT AdvSIMD
Job-OYJLBY : .NET 8.0.0 (42.42.42.42424), Arm64 RyuJIT AdvSIMD
Job-GKZVCN : .NET 8.0.0 (42.42.42.42424), Arm64 RyuJIT AdvSIMD

PowerPlanMode=00000000-0000-0000-0000-000000000000 IterationTime=250.0000 ms MaxIterationCount=20
MinIterationCount=15 WarmupCount=1

Method	Job	BufferSize	Mean	Error	StdDev	Median	Min	Max	Ratio
Append	Scalar	16	38.44 ns	0.003 ns	0.002 ns	38.44 ns	38.43 ns	38.44 ns	1.00
Append	Vector	16	13.57 ns	0.002 ns	0.002 ns	13.57 ns	13.56 ns	13.57 ns	0.35

Append	Scalar	256	619.02 ns	0.143 ns	0.133 ns	619.00 ns	618.85 ns	619.35 ns	1.00
Append	Vector	256	28.70 ns	0.025 ns	0.021 ns	28.70 ns	28.68 ns	28.75 ns	0.05

Append	Scalar	1024	2,465.47 ns	0.183 ns	0.153 ns	2,465.44 ns	2,465.18 ns	2,465.68 ns	1.00
Append	Vector	1024	79.71 ns	0.104 ns	0.097 ns	79.68 ns	79.58 ns	79.93 ns	0.03

Author:	brantburnett
Assignees:	-
Labels:	`area-System.IO`, `community-contribution`
Milestone:	-

brantburnett · 2023-04-23T18:09:03Z

/cc @tannergooding

adamsitnik

LGTM, very impressive improvements @brantburnett !

adamsitnik · 2023-05-17T15:28:20Z

src/libraries/System.IO.Hashing/src/System/IO/Hashing/Crc32.Vectorized.cs

-                Vector128<ulong> x6 = CarrylessMultiplyLower(x2, x0);
-                Vector128<ulong> x7 = CarrylessMultiplyLower(x3, x0);
-                Vector128<ulong> x8 = CarrylessMultiplyLower(x4, x0);
+                x5 = VectorHelper.CarrylessMultiplyLower(x1, x0);


It was a good idea to move these methods to other type and reuse them. 👍

To avoid the need of adding the type name everywhere these methods were used you could just use using static at the top of the file

using static System.IO.Hashing.VectorHelper;

adamsitnik · 2023-05-17T16:12:44Z

src/libraries/System.IO.Hashing/src/System/IO/Hashing/Crc64.Vectorized.cs

+
+            // Work with a reference to where we're at in the ReadOnlySpan and a local length
+            // to avoid extraneous range checks.
+            ref byte srcRef = ref MemoryMarshal.GetReference(source);


Personally I would prefer to store a reference to ulong rather than byte and in every loop iteration update the index rather than the reference, but since similar pattern was used in #83321 and approved by people more knowledgeable in this area, so I won't suggest it.

- ref byte srcRef = ref MemoryMarshal.GetReference(source); + ref ulong srcRef = ref Unsafe.As<byte, ulong>(ref MemoryMarshal.GetReference(source));

Vectorize the CRC64 implementation

3365881

ghost added the community-contribution Indicates that the PR has been added by a community member label Apr 23, 2023

dotnet-issue-labeler bot added the area-System.IO label Apr 23, 2023

brantburnett marked this pull request as ready for review April 23, 2023 18:08

adamsitnik added the tenet-performance Performance related issue label May 17, 2023

adamsitnik approved these changes May 17, 2023

View reviewed changes

adamsitnik merged commit 30c1d8f into dotnet:main May 18, 2023
104 of 107 checks passed

brantburnett deleted the crc64-vector branch May 18, 2023 18:56

dotnet locked as resolved and limited conversation to collaborators Jun 17, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Vectorize the CRC64 implementation #85221

Vectorize the CRC64 implementation #85221

brantburnett commented Apr 23, 2023

ghost commented Apr 23, 2023

brantburnett commented Apr 23, 2023

adamsitnik left a comment

adamsitnik May 17, 2023

adamsitnik May 17, 2023

Vectorize the CRC64 implementation #85221

Vectorize the CRC64 implementation #85221

Conversation

brantburnett commented Apr 23, 2023

ghost commented Apr 23, 2023

brantburnett commented Apr 23, 2023

adamsitnik left a comment

Choose a reason for hiding this comment

adamsitnik May 17, 2023

Choose a reason for hiding this comment

adamsitnik May 17, 2023

Choose a reason for hiding this comment