Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Vectorize IEnumerable<T>.Sum where possible #84519

Merged
merged 5 commits into from May 22, 2023

Conversation

brantburnett
Copy link
Contributor

@brantburnett brantburnett commented Apr 8, 2023

When performing Sum() on IEnumerable<T> of type int or long and when IEnumerable<T> is representable as a ReadOnlySpan<T> (such as arrays and List<T>) it is possible to vectorize the implementation for improved performance. The only added cost for the slow-path fallback is a length check to be sure the ReadOnlySpan<T> is at least 4 Vector<T> long.

Note that basic vectorized addition doesn't perform overflow checks, so the checks must be implemented in code. Despite this extra cost vectorization appears to be a net gain. The exception is 128-bit vectors holding 64-bit long integers, so they are excluded from this optimization.

This should also have knock-on improvements for Average() in the long case on Intel. The int case of Average() is already vectorized using a specialized approach.

Benchmarks below are on an Intel Core i7 for both int and long and on an ARM AWS Graviton2 for int. long was not tested on ARM because ARM uses 128-bit vectors.

BenchmarkDotNet=v0.13.2.2052-nightly, OS=Windows 11 (10.0.22621.1485)
Intel Core i7-10850H CPU 2.70GHz, 1 CPU, 12 logical and 6 physical cores
.NET SDK=8.0.100-preview.2.23157.25
[Host] : .NET 8.0.0 (8.0.23.12803), X64 RyuJIT AVX2
Job-HVWQID : .NET 8.0.0 (42.42.42.42424), X64 RyuJIT AVX2
Job-GLTODN : .NET 8.0.0 (42.42.42.42424), X64 RyuJIT AVX2

PowerPlanMode=00000000-0000-0000-0000-000000000000 IterationTime=250.0000 ms MaxIterationCount=20
MinIterationCount=15 WarmupCount=1

Type Job Length Mean Error StdDev Median Min Max Ratio
Int Current 32 14.34 ns 0.172 ns 0.152 ns 14.32 ns 14.09 ns 14.67 ns 1.00
Int Vectorized 32 10.34 ns 0.152 ns 0.142 ns 10.35 ns 10.16 ns 10.56 ns 0.72
Int Current 128 48.33 ns 0.132 ns 0.110 ns 48.34 ns 47.99 ns 48.42 ns 1.00
Int Vectorized 128 19.28 ns 0.686 ns 0.790 ns 19.32 ns 18.19 ns 20.75 ns 0.41
Int Current 1024 341.92 ns 6.583 ns 6.465 ns 345.28 ns 329.47 ns 348.16 ns 1.00
Int Vectorized 1024 83.02 ns 0.090 ns 0.080 ns 83.01 ns 82.91 ns 83.18 ns 0.24
Long Current 32 13.93 ns 0.093 ns 0.082 ns 13.93 ns 13.68 ns 14.01 ns 1.00
Long Vectorized 32 11.22 ns 0.078 ns 0.069 ns 11.24 ns 10.99 ns 11.27 ns 0.81
Long Current 128 47.72 ns 0.142 ns 0.133 ns 47.77 ns 47.30 ns 47.82 ns 1.00
Long Vectorized 128 24.58 ns 0.165 ns 0.146 ns 24.62 ns 24.08 ns 24.67 ns 0.52
Long Current 1024 333.73 ns 0.799 ns 0.748 ns 333.88 ns 331.41 ns 334.44 ns 1.00
Long Vectorized 1024 150.21 ns 0.237 ns 0.222 ns 150.13 ns 149.89 ns 150.70 ns 0.45

BenchmarkDotNet=v0.13.2.2052-nightly, OS=ubuntu 22.04
AWS m6g.xlarge Graviton2
.NET SDK=8.0.100-preview.1.23115.2
[Host] : .NET 8.0.0 (8.0.23.11008), Arm64 RyuJIT AdvSIMD
Job-MEXHPT : .NET 8.0.0 (42.42.42.42424), Arm64 RyuJIT AdvSIMD
Job-VGHVOM : .NET 8.0.0 (42.42.42.42424), Arm64 RyuJIT AdvSIMD

PowerPlanMode=00000000-0000-0000-0000-000000000000 IterationTime=250.0000 ms MaxIterationCount=20
MinIterationCount=15 WarmupCount=1

Type Job Length Mean Error StdDev Median Min Max Ratio
Int Current 32 28.12 ns 0.043 ns 0.041 ns 28.12 ns 28.08 ns 28.19 ns 1.00
Int Vectorized 32 16.93 ns 0.031 ns 0.026 ns 16.93 ns 16.89 ns 16.98 ns 0.60
Int Current 128 103.80 ns 0.023 ns 0.021 ns 103.80 ns 103.75 ns 103.83 ns 1.00
Int Vectorized 128 56.63 ns 0.072 ns 0.064 ns 56.62 ns 56.53 ns 56.77 ns 0.55
Int Current 1024 832.41 ns 0.108 ns 0.090 ns 832.37 ns 832.28 ns 832.58 ns 1.00
Int Vectorized 1024 432.59 ns 0.087 ns 0.081 ns 432.58 ns 432.46 ns 432.74 ns 0.52

@ghost ghost added the community-contribution Indicates that the PR has been added by a community member label Apr 8, 2023
@ghost
Copy link

ghost commented Apr 8, 2023

Tagging subscribers to this area: @dotnet/area-system-linq
See info in area-owners.md if you want to be subscribed.

Issue Details

When performing Sum() on IEnumerable<T> of type int or longand whenIEnumerableis representable as aReadOnlySpan(such as arrays andList) it is possible to vectorize the implementation for improved performance. The only added cost for the slow-path fallback is a length check to be sure the ReadOnlySpanis at least 4Vector` long.

Note that basic vectorized addition doesn't perform overflow checks, so the checks must be implemented in code. Despite this extra cost vectorization appears to be a net gain. The exception is 128-bit vectors holding 64-bit long integers, so they are excluded from this optimization.

This should also have knock-on improvements for Average() in the long case on Intel. The int case of Average() is already vectorized using a specialized approach.

Benchmarks below are on an Intel Core i7 for both int and long and on an ARM AWS Graviton2 for int. long was not tested on ARM because ARM uses 128-bit vectors.

BenchmarkDotNet=v0.13.2.2052-nightly, OS=Windows 11 (10.0.22621.1485)
Intel Core i7-10850H CPU 2.70GHz, 1 CPU, 12 logical and 6 physical cores
.NET SDK=8.0.100-preview.2.23157.25
[Host] : .NET 8.0.0 (8.0.23.12803), X64 RyuJIT AVX2
Job-HVWQID : .NET 8.0.0 (42.42.42.42424), X64 RyuJIT AVX2
Job-GLTODN : .NET 8.0.0 (42.42.42.42424), X64 RyuJIT AVX2

PowerPlanMode=00000000-0000-0000-0000-000000000000 IterationTime=250.0000 ms MaxIterationCount=20
MinIterationCount=15 WarmupCount=1

Type Job Length Mean Error StdDev Median Min Max Ratio
Int Vectorized 32 14.34 ns 0.172 ns 0.152 ns 14.32 ns 14.09 ns 14.67 ns 1.00
Int Current 32 10.34 ns 0.152 ns 0.142 ns 10.35 ns 10.16 ns 10.56 ns 0.72
Int Vectorized 128 48.33 ns 0.132 ns 0.110 ns 48.34 ns 47.99 ns 48.42 ns 1.00
Int Current 128 19.28 ns 0.686 ns 0.790 ns 19.32 ns 18.19 ns 20.75 ns 0.41
Int Vectorized 1024 341.92 ns 6.583 ns 6.465 ns 345.28 ns 329.47 ns 348.16 ns 1.00
Int Current 1024 83.02 ns 0.090 ns 0.080 ns 83.01 ns 82.91 ns 83.18 ns 0.24
Long Vectorized 32 13.93 ns 0.093 ns 0.082 ns 13.93 ns 13.68 ns 14.01 ns 1.00
Long Current 32 11.22 ns 0.078 ns 0.069 ns 11.24 ns 10.99 ns 11.27 ns 0.81
Long Vectorized 128 47.72 ns 0.142 ns 0.133 ns 47.77 ns 47.30 ns 47.82 ns 1.00
Long Current 128 24.58 ns 0.165 ns 0.146 ns 24.62 ns 24.08 ns 24.67 ns 0.52
Long Vectorized 1024 333.73 ns 0.799 ns 0.748 ns 333.88 ns 331.41 ns 334.44 ns 1.00
Long Current 1024 150.21 ns 0.237 ns 0.222 ns 150.13 ns 149.89 ns 150.70 ns 0.45

BenchmarkDotNet=v0.13.2.2052-nightly, OS=ubuntu 22.04
AWS m6g.xlarge Graviton2
.NET SDK=8.0.100-preview.1.23115.2
[Host] : .NET 8.0.0 (8.0.23.11008), Arm64 RyuJIT AdvSIMD
Job-MEXHPT : .NET 8.0.0 (42.42.42.42424), Arm64 RyuJIT AdvSIMD
Job-VGHVOM : .NET 8.0.0 (42.42.42.42424), Arm64 RyuJIT AdvSIMD

PowerPlanMode=00000000-0000-0000-0000-000000000000 IterationTime=250.0000 ms MaxIterationCount=20
MinIterationCount=15 WarmupCount=1

Type Job Length Mean Error StdDev Median Min Max Ratio
Int Vectorized 32 28.12 ns 0.043 ns 0.041 ns 28.12 ns 28.08 ns 28.19 ns 1.00
Int Current 32 16.93 ns 0.031 ns 0.026 ns 16.93 ns 16.89 ns 16.98 ns 0.60
Int Vectorized 128 103.80 ns 0.023 ns 0.021 ns 103.80 ns 103.75 ns 103.83 ns 1.00
Int Current 128 56.63 ns 0.072 ns 0.064 ns 56.62 ns 56.53 ns 56.77 ns 0.55
Int Vectorized 1024 832.41 ns 0.108 ns 0.090 ns 832.37 ns 832.28 ns 832.58 ns 1.00
Int Current 1024 432.59 ns 0.087 ns 0.081 ns 432.58 ns 432.46 ns 432.74 ns 0.52
Author: brantburnett
Assignees: -
Labels:

area-System.Linq, community-contribution

Milestone: -

@brantburnett brantburnett marked this pull request as ready for review April 9, 2023 00:50
@KeterSCP
Copy link

KeterSCP commented Apr 9, 2023

@brantburnett according to your benchmarks table, the vectorized version is slower for all cases, this might be a mistake I suppose?

@brantburnett
Copy link
Contributor Author

@brantburnett according to your benchmarks table, the vectorized version is slower for all cases, this might be a mistake I suppose?

Yes, silly typo when I was replacing the long --corerun paths with something meaningful. Thanks for pointing it out, I've corrected it.

Copy link
Member

@eiriktsarpalis eiriktsarpalis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be interesting if you could share any benchmark results showcasing the improvements of the change.

You can find details on how to benchmark private builds in this doc.

@brantburnett
Copy link
Contributor Author

It would be interesting if you could share any benchmark results showcasing the improvements of the change.

You can find details on how to benchmark private builds in this doc.

There are benchmark results for both x64 and ARM in the main PR description, are there other benchmarks you'd like to see?

@eiriktsarpalis
Copy link
Member

eiriktsarpalis commented Apr 21, 2023

There are benchmark results for both x64 and ARM in the main PR description, are there other benchmarks you'd like to see?

D'oh! Completely skipped past those :-)

Vector<T> accumulator = Vector<T>.Zero;

// Build a test vector with only the sign bit set in each element. JIT will fold this into a constant.
Vector<T> overflowTestVector = new(T.RotateRight(T.MultiplicativeIdentity, 1));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that we know T works with Vector, we know it must be a primitive value which is 2's complement. So we can add where T : IMinMaxValue<T> and simply use T.MinValue instead to get a mask of the sign.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Comment on lines 130 to 131
ptr = ref Unsafe.Add(ref ptr, Vector<T>.Count * 4);
length -= Vector<T>.Count * 4;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar feedback has been given on other PRs, but we'd prefer to avoid mutating the byref where possible in favor of standard indexing and using LoadUnsafe(ref baseAddress, index) instead. You can then increment index by Vector<T>.Count in the loop.

This can result in slightly worse codegen in some cases, but it makes the code simpler and less prone to GC holes and similar bugs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, and the JIT output still looks decent. It was able to encode the offset arithmetic into the operation on x64:

lea ebx, [eax+edi+0x40]

Unfortunately, it's an extra lea operation. While this did impact the benchmarks, they are still an improvement over unvectorized.

Comment on lines 164 to 168
// Add any remaining elements
for (int i = 0; i < length; i++)
{
checked { result += Unsafe.Add(ref ptr, i); }
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be handled by doing a backtrack and masking off already processed data.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have any suggestions on how to build a mask like that in a Vector<T>? So far the only way I've found is looping and calling WithElement, but looking at the resulting JIT of that loop I feel like just adding the elements from the span will be way better.

    L0126: vxorps ymm3, ymm3, ymm3
    L012a: vmovupd [ebp-0x2c], ymm3
    L012f: cmp eax, 8
    L0132: jae L020a
    L0138: vmovupd [ebp-0x2c], ymm2
    L013d: lea ecx, [ebp-0x2c]
    L0140: xor esi, esi
    L0142: mov [ecx+eax*4], esi
    L0145: vmovupd ymm2, [ebp-0x2c]
    L014a: inc eax
    L014b: cmp edx, eax
    L014d: jg short L0126

I considered a static ROS of zero elements followed by all bits set elements that could be indexed to find the mask. However, this is problematic because we're dealing with generic integers.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I think I have a partial solution. Using left byte-shift operations on vectors will work to fill with zero elements, since we don't care if we move the elements being summed into different lanes. The problem is that this only works with Vector128<T> on Intel. So to use this solution we'd need to either:

A) Only implement as Vector128<T>, losing long support and running slower on Intel.
or B) Implement split Vector128<T> and Vector256<T> implementations, with lots of code duplication. We'd then only gain the backtrack advantages on short int arrays that fallback to Vector128<T>, old Intel processors, and ARM.

I don't think A is a good option. And I'm not sure B is worth it? Let me know what you think.

@eiriktsarpalis eiriktsarpalis self-assigned this May 15, 2023
@adamsitnik adamsitnik added the tenet-performance Performance related issue label May 17, 2023
@eiriktsarpalis eiriktsarpalis merged commit 07af177 into dotnet:main May 22, 2023
104 of 107 checks passed
@dotnet dotnet locked as resolved and limited conversation to collaborators Jun 22, 2023
@brantburnett brantburnett deleted the sum-vectorized branch July 21, 2023 17:34
@ericstj ericstj added the breaking-change Issue or PR that represents a breaking API or functional change over a prerelease. label Oct 10, 2023
@ghost ghost added the needs-breaking-change-doc-created Breaking changes need an issue opened with https://github.com/dotnet/docs/issues/new?template=dotnet label Oct 10, 2023
@ericstj ericstj added this to the 8.0.0 milestone Oct 16, 2023
@eiriktsarpalis eiriktsarpalis removed the needs-breaking-change-doc-created Breaking changes need an issue opened with https://github.com/dotnet/docs/issues/new?template=dotnet label Oct 25, 2023
@eiriktsarpalis
Copy link
Member

Breaking change doc: dotnet/docs#37734

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-System.Linq breaking-change Issue or PR that represents a breaking API or functional change over a prerelease. community-contribution Indicates that the PR has been added by a community member tenet-performance Performance related issue
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants