Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Vectorize the CRC64 implementation #85221

Merged
merged 1 commit into from May 18, 2023
Merged

Conversation

brantburnett
Copy link
Contributor

This significantly improves performance for System.IO.Hashing.Crc64 for cases where the source span is 16 bytes or larger on Intel x86/x64 and modern ARM architectures. The vectorization change only applies to .NET 7 and later targets of System.IO.Hashing because it uses some Vector128 APIs added in .NET 7.

This is a continuation of work done in #83321 which added vectorization to CRC32.

BenchmarkDotNet=v0.13.2.2052-nightly, OS=Windows 11 (10.0.22621.1631)
Intel Core i7-10850H CPU 2.70GHz, 1 CPU, 12 logical and 6 physical cores
.NET SDK=8.0.100-preview.3.23178.7
[Host] : .NET 8.0.0 (8.0.23.17408), X64 RyuJIT AVX2
Job-FPBBMO : .NET 8.0.0 (42.42.42.42424), X64 RyuJIT AVX2
Job-FTHZKV : .NET 8.0.0 (42.42.42.42424), X64 RyuJIT AVX2

PowerPlanMode=00000000-0000-0000-0000-000000000000 IterationTime=250.0000 ms MaxIterationCount=20
MinIterationCount=15 WarmupCount=1

Method Job BufferSize Mean Error StdDev Median Min Max Ratio
Append Scalar 16 30.529 ns 0.3658 ns 0.3422 ns 30.656 ns 29.968 ns 30.986 ns 1.00
Append Vector 16 6.056 ns 0.0193 ns 0.0181 ns 6.046 ns 6.035 ns 6.089 ns 0.20
Append Scalar 256 496.097 ns 2.8740 ns 2.5477 ns 496.725 ns 488.920 ns 499.684 ns 1.00
Append Vector 256 14.741 ns 0.0614 ns 0.0512 ns 14.753 ns 14.589 ns 14.798 ns 0.03
Append Scalar 1024 1,986.624 ns 11.4688 ns 10.7279 ns 1,984.088 ns 1,971.231 ns 2,004.409 ns 1.00
Append Vector 1024 44.201 ns 0.1042 ns 0.0924 ns 44.196 ns 44.062 ns 44.413 ns 0.02

BenchmarkDotNet=v0.13.2.2052-nightly, OS=ubuntu 22.04
AWS m6g.xlarge Graviton2
.NET SDK=8.0.100-preview.3.23178.7
[Host] : .NET 8.0.0 (8.0.23.17408), Arm64 RyuJIT AdvSIMD
Job-OYJLBY : .NET 8.0.0 (42.42.42.42424), Arm64 RyuJIT AdvSIMD
Job-GKZVCN : .NET 8.0.0 (42.42.42.42424), Arm64 RyuJIT AdvSIMD

PowerPlanMode=00000000-0000-0000-0000-000000000000 IterationTime=250.0000 ms MaxIterationCount=20
MinIterationCount=15 WarmupCount=1

Method Job BufferSize Mean Error StdDev Median Min Max Ratio
Append Scalar 16 38.44 ns 0.003 ns 0.002 ns 38.44 ns 38.43 ns 38.44 ns 1.00
Append Vector 16 13.57 ns 0.002 ns 0.002 ns 13.57 ns 13.56 ns 13.57 ns 0.35
Append Scalar 256 619.02 ns 0.143 ns 0.133 ns 619.00 ns 618.85 ns 619.35 ns 1.00
Append Vector 256 28.70 ns 0.025 ns 0.021 ns 28.70 ns 28.68 ns 28.75 ns 0.05
Append Scalar 1024 2,465.47 ns 0.183 ns 0.153 ns 2,465.44 ns 2,465.18 ns 2,465.68 ns 1.00
Append Vector 1024 79.71 ns 0.104 ns 0.097 ns 79.68 ns 79.58 ns 79.93 ns 0.03

@ghost ghost added the community-contribution Indicates that the PR has been added by a community member label Apr 23, 2023
@ghost
Copy link

ghost commented Apr 23, 2023

Tagging subscribers to this area: @dotnet/area-system-io
See info in area-owners.md if you want to be subscribed.

Issue Details

This significantly improves performance for System.IO.Hashing.Crc64 for cases where the source span is 16 bytes or larger on Intel x86/x64 and modern ARM architectures. The vectorization change only applies to .NET 7 and later targets of System.IO.Hashing because it uses some Vector128 APIs added in .NET 7.

This is a continuation of work done in #83321 which added vectorization to CRC32.

BenchmarkDotNet=v0.13.2.2052-nightly, OS=Windows 11 (10.0.22621.1631)
Intel Core i7-10850H CPU 2.70GHz, 1 CPU, 12 logical and 6 physical cores
.NET SDK=8.0.100-preview.3.23178.7
[Host] : .NET 8.0.0 (8.0.23.17408), X64 RyuJIT AVX2
Job-FPBBMO : .NET 8.0.0 (42.42.42.42424), X64 RyuJIT AVX2
Job-FTHZKV : .NET 8.0.0 (42.42.42.42424), X64 RyuJIT AVX2

PowerPlanMode=00000000-0000-0000-0000-000000000000 IterationTime=250.0000 ms MaxIterationCount=20
MinIterationCount=15 WarmupCount=1

Method Job BufferSize Mean Error StdDev Median Min Max Ratio
Append Scalar 16 30.529 ns 0.3658 ns 0.3422 ns 30.656 ns 29.968 ns 30.986 ns 1.00
Append Vector 16 6.056 ns 0.0193 ns 0.0181 ns 6.046 ns 6.035 ns 6.089 ns 0.20
Append Scalar 256 496.097 ns 2.8740 ns 2.5477 ns 496.725 ns 488.920 ns 499.684 ns 1.00
Append Vector 256 14.741 ns 0.0614 ns 0.0512 ns 14.753 ns 14.589 ns 14.798 ns 0.03
Append Scalar 1024 1,986.624 ns 11.4688 ns 10.7279 ns 1,984.088 ns 1,971.231 ns 2,004.409 ns 1.00
Append Vector 1024 44.201 ns 0.1042 ns 0.0924 ns 44.196 ns 44.062 ns 44.413 ns 0.02

BenchmarkDotNet=v0.13.2.2052-nightly, OS=ubuntu 22.04
AWS m6g.xlarge Graviton2
.NET SDK=8.0.100-preview.3.23178.7
[Host] : .NET 8.0.0 (8.0.23.17408), Arm64 RyuJIT AdvSIMD
Job-OYJLBY : .NET 8.0.0 (42.42.42.42424), Arm64 RyuJIT AdvSIMD
Job-GKZVCN : .NET 8.0.0 (42.42.42.42424), Arm64 RyuJIT AdvSIMD

PowerPlanMode=00000000-0000-0000-0000-000000000000 IterationTime=250.0000 ms MaxIterationCount=20
MinIterationCount=15 WarmupCount=1

Method Job BufferSize Mean Error StdDev Median Min Max Ratio
Append Scalar 16 38.44 ns 0.003 ns 0.002 ns 38.44 ns 38.43 ns 38.44 ns 1.00
Append Vector 16 13.57 ns 0.002 ns 0.002 ns 13.57 ns 13.56 ns 13.57 ns 0.35
Append Scalar 256 619.02 ns 0.143 ns 0.133 ns 619.00 ns 618.85 ns 619.35 ns 1.00
Append Vector 256 28.70 ns 0.025 ns 0.021 ns 28.70 ns 28.68 ns 28.75 ns 0.05
Append Scalar 1024 2,465.47 ns 0.183 ns 0.153 ns 2,465.44 ns 2,465.18 ns 2,465.68 ns 1.00
Append Vector 1024 79.71 ns 0.104 ns 0.097 ns 79.68 ns 79.58 ns 79.93 ns 0.03
Author: brantburnett
Assignees: -
Labels:

area-System.IO, community-contribution

Milestone: -

@brantburnett brantburnett marked this pull request as ready for review April 23, 2023 18:08
@brantburnett
Copy link
Contributor Author

/cc @tannergooding

@adamsitnik adamsitnik added the tenet-performance Performance related issue label May 17, 2023
Copy link
Member

@adamsitnik adamsitnik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, very impressive improvements @brantburnett !

Vector128<ulong> x6 = CarrylessMultiplyLower(x2, x0);
Vector128<ulong> x7 = CarrylessMultiplyLower(x3, x0);
Vector128<ulong> x8 = CarrylessMultiplyLower(x4, x0);
x5 = VectorHelper.CarrylessMultiplyLower(x1, x0);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was a good idea to move these methods to other type and reuse them. 👍

To avoid the need of adding the type name everywhere these methods were used you could just use using static at the top of the file

using static System.IO.Hashing.VectorHelper;


// Work with a reference to where we're at in the ReadOnlySpan and a local length
// to avoid extraneous range checks.
ref byte srcRef = ref MemoryMarshal.GetReference(source);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personally I would prefer to store a reference to ulong rather than byte and in every loop iteration update the index rather than the reference, but since similar pattern was used in #83321 and approved by people more knowledgeable in this area, so I won't suggest it.

- ref byte srcRef = ref MemoryMarshal.GetReference(source);
+ ref ulong srcRef = ref Unsafe.As<byte, ulong>(ref MemoryMarshal.GetReference(source));

@adamsitnik adamsitnik merged commit 30c1d8f into dotnet:main May 18, 2023
104 of 107 checks passed
@brantburnett brantburnett deleted the crc64-vector branch May 18, 2023 18:56
@dotnet dotnet locked as resolved and limited conversation to collaborators Jun 17, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-System.IO community-contribution Indicates that the PR has been added by a community member tenet-performance Performance related issue
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants