Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Vectorize Enumerable.Range initialization, take 2 #87992

Merged
merged 3 commits into from Jul 21, 2023

Conversation

neon-sunset
Copy link
Contributor

Previous attempt: #80633

This one fixes the regression on x86_64 by avoiding an SDIV and having smaller method body.

Diffs:

@ghost ghost added the community-contribution Indicates that the PR has been added by a community member label Jun 24, 2023
@ghost
Copy link

ghost commented Jun 24, 2023

Tagging subscribers to this area: @dotnet/area-system-linq
See info in area-owners.md if you want to be subscribed.

Issue Details

Previous attempt: #80633

This one fixes the regression on x86_64 by avoiding an SDIV and having smaller method body.

Diffs:

Author: neon-sunset
Assignees: -
Labels:

area-System.Linq, community-contribution

Milestone: -

goto Scalar;
}

int stride = Vector256<int>.Count;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason to use V256 instead of V128 to support both x86 and arm?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please check the Aarch64 diff in the description. The loop is written in such a way that Vector256 codegen looks exactly like Vector128x2, because 2x unrolling seems to be advantageous in case of NEON, and LLVM does it for Rust's (0..n).collect() (two adds + stp).

This is a bit unconventional compared to all other SIMD code in CoreLib, but I wanted to try to keep the size of the code to a minimum because this is System.Linq.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok what's the codegen for e.g. SSE2 (NativeAOT's default target)?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, I don't see any guards for SIMD at all - arm32, linux-x86, mono-interp, HWINTRINSIC=0

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok what's the codegen for e.g. SSE2 (NativeAOT's default target)?

Let me check. Also, is it really as low as SSE2? That would be unfortunate.

Copy link
Member

@EgorBo EgorBo Jun 24, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we use this approach anywhere else in BCL? Last time we discussed this, V256 via V128x2 was mainly used for testing only. Does mono handle it well?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do you want the implementation to look like? Full span.IndexOf style with V256 -> V128 -> Scalar blocks?

Does mono handle it well?

No idea, do you know if mono handles it well?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do you want the implementation to look like?

I am fine with the codegen you posted - I am just not aware if this approach is recommended/used anywhere. cc @tannergooding

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do you want the implementation to look like?

Has something changed since the previous PR? My feedback there was that without motivating scenarios, this wasn't worth vectorizing at all. Can you point to places where Enumerable.Range(...).ToArray/ToList is being used in production code on perf-sensitive code paths?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for reviewing the change. Here are the concerns listed in the original PR that were stating the impl. was not worth it:

  • Negative impact on code size
  • Negative impact on maintainability
  • Negative impact on un-vectorized path

I tried to address the first two by keeping the implementation as simple as possible (+35 lines) and evaluating codegen size diff, which is +60B on x86_64 and +100B on Aarch64. The third one is addressed by getting rid of length % Vector<int>.Count which was causing regressions on x86_64.

It is true that the pattern most often used in tests, but I don't agree with the assertion that it is not a useful change (at some point you do start to care about test execution time, so it is nice to have 2-4x speed up for moderately used path). There are also references to non-test code that could benefit:

Matches: https://grep.app/search?q=Enumerable%5C.Range%5C%28%5B%5E%29%5D%2B%5C%29%5C.%28ToArray%7CToList%29%5C%28%5C%29&regexp=true&case=true&filter[lang][0]=C%23 (22 pages of results)

There is also a precedent for having SIMD code in System.Linq: https://github.com/dotnet/runtime/blob/main/src/libraries/System.Linq/src/System/Linq/Sum.cs#L75

return list;
}

private static void Fill(Span<int> destination, int value)
[MethodImpl(MethodImplOptions.AggressiveInlining)]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this should be aggressively inlined.

{
for (int i = 0; i < destination.Length; i++, value++)
// The improvements in the compiler now allow us to write the loop as Vector256.
// Manually unrolling to Vector128x2 for SSE2/NEON is no longer profitable.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we just use Vector<T>? e.g.

static void Fill(Span<int> destination, int value)
{
    Debug.Assert(Vector<int>.Count * sizeof(int) * 8 <= 512);

    ref int c = ref MemoryMarshal.GetReference(destination);
    ref int end = ref Unsafe.Add(ref c, destination.Length);

    if (Vector.IsHardwareAccelerated && destination.Length >= Vector<int>.Count)
    {
        var init = new Vector<int>((ReadOnlySpan<int>)new int[] { 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15 });
        var increment = new Vector<int>(Vector<int>.Count);
        var current = new Vector<int>(value) + init;

        ref int oneVectorFromEnd = ref Unsafe.Subtract(ref end, Vector<int>.Count);
        do
        {
            current.StoreUnsafe(ref c);
            current += increment;
            c = ref Unsafe.Add(ref c, Vector<int>.Count);
        }
        while (!Unsafe.IsAddressLessThan(ref oneVectorFromEnd, ref c));

        value = current[0];
    }

    while (Unsafe.IsAddressLessThan(ref c, ref end))
    {
        c = value++;
        c = ref Unsafe.Add(ref c, 1);
    }
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is definitely an option, but I initially opted out of it to have better performance on ARM64. As of now, there are pros and cons.

Vector<T> variant
Pros:

  • Takes advantage of AVX512
  • Does not rely on JIT/ILC being able to perfectly optimize V256 as V128x2 for AdvSimd/SSE2
  • Consistent with how other vectorized code is written
  • Better performance on Mono? (as @EgorBo mentioned, Mono regressions are an issue that needs to be tracked)

Cons:

  • Does not help 0..15 i.e. Enumerable.Range(0, 10).ToArray() , which is fairly common, on AVX512
  • Has reduced gains on wide ARM64 cores (M series can dispatch V128x2 loop iteration in a single cycle, ARM Cortex-X3/4 should be able to do the same). This also applies to default NativeAOT target which would use 128x2 with SSE2. I expect x86_64 behavior to be very similar.

Do you think Vector<T> is better for the time being?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Takes advantage of AVX512

Not by default. By default, Vector<T> will be either 128-bits or 256-bits.

Do you think Vector is better for the time being?

Yes

@neon-sunset neon-sunset force-pushed the simd-range-init branch 2 times, most recently from 342ad85 to 6165aa3 Compare July 6, 2023 12:31
@neon-sunset
Copy link
Contributor Author

neon-sunset commented Jul 6, 2023

@stephentoub I have addressed the feedback but kept AggressiveInlining because Fill will get inlined into ToArray, ToList and CopyTo which are behind virtual calls and won't get inlined further. Diff example: https://www.diffchecker.com/MWZRN2Sy/

@stephentoub
Copy link
Member

which are behind virtual calls and won't get inlined further

They can be with dynamic pgo. Please remove the aggressive inlining.

@neon-sunset
Copy link
Contributor Author

They can be with dynamic pgo. Please remove the aggressive inlining.

Sorry for the confusion, I missed the attribute when force-pushing. The change is already without it, as you requested.

@stephentoub stephentoub merged commit f016dc4 into dotnet:main Jul 21, 2023
103 checks passed
@stephentoub
Copy link
Member

Thanks.

@neon-sunset
Copy link
Contributor Author

@stephentoub thank you for merging this and apologies for not addressing the feedback previously!
The merged change yields 1.5-4x speed-up vs 7.0 depending on the range length (and up to 10x for List thanks to CollectionsMarshal.SetCount):

x86_64

BenchmarkDotNet v0.13.6, Windows 11 (10.0.22621.1992/22H2/2022Update/SunValley2)
AMD Ryzen 7 5800X, 1 CPU, 16 logical and 8 physical cores
.NET SDK 8.0.100-rc.1.23371.8
  [Host]        : .NET 8.0.0 (8.0.23.37103), X64 RyuJIT AVX2
  .NET 7.0      : .NET 7.0.9 (7.0.923.32018), X64 RyuJIT AVX2
  .NET 8.0      : .NET 8.0.0 (8.0.23.37103), X64 RyuJIT AVX2
  NativeAOT 8.0 : .NET 8.0.0-rc.1.23371.3, X64 NativeAOT AVX2
Method Job Runtime Length Mean Error StdDev Ratio
Array .NET 7.0 .NET 7.0 10 14.09 ns 0.125 ns 0.111 ns 1.00
Array .NET 8.0 .NET 8.0 10 11.89 ns 0.124 ns 0.104 ns 0.84
Array NativeAOT 8.0 NativeAOT 8.0 10 11.66 ns 0.142 ns 0.133 ns 0.83
List .NET 7.0 .NET 7.0 10 22.10 ns 0.196 ns 0.183 ns 1.00
List .NET 8.0 .NET 8.0 10 15.41 ns 0.161 ns 0.151 ns 0.70
List NativeAOT 8.0 NativeAOT 8.0 10 20.20 ns 0.132 ns 0.103 ns 0.92
Array .NET 7.0 .NET 7.0 100 45.49 ns 0.481 ns 0.376 ns 1.00
Array .NET 8.0 .NET 8.0 100 22.23 ns 0.264 ns 0.234 ns 0.49
Array NativeAOT 8.0 NativeAOT 8.0 100 22.36 ns 0.374 ns 0.331 ns 0.49
List .NET 7.0 .NET 7.0 100 112.15 ns 1.528 ns 1.276 ns 1.00
List .NET 8.0 .NET 8.0 100 25.09 ns 0.215 ns 0.191 ns 0.22
List NativeAOT 8.0 NativeAOT 8.0 100 32.31 ns 0.378 ns 0.335 ns 0.29
Array .NET 7.0 .NET 7.0 1000 356.33 ns 3.589 ns 3.358 ns 1.00
Array .NET 8.0 .NET 8.0 1000 117.84 ns 2.250 ns 2.105 ns 0.33
Array NativeAOT 8.0 NativeAOT 8.0 1000 121.89 ns 1.932 ns 1.807 ns 0.34
List .NET 7.0 .NET 7.0 1000 984.15 ns 7.978 ns 6.662 ns 1.00
List .NET 8.0 .NET 8.0 1000 125.00 ns 2.064 ns 1.931 ns 0.13
List NativeAOT 8.0 NativeAOT 8.0 1000 133.36 ns 1.298 ns 1.214 ns 0.14
Array .NET 7.0 .NET 7.0 10000 3,209.78 ns 21.071 ns 19.709 ns 1.00
Array .NET 8.0 .NET 8.0 10000 958.51 ns 18.953 ns 21.067 ns 0.30
Array NativeAOT 8.0 NativeAOT 8.0 10000 947.25 ns 16.773 ns 14.007 ns 0.29
List .NET 7.0 .NET 7.0 10000 9,387.75 ns 86.637 ns 81.040 ns 1.00
List .NET 8.0 .NET 8.0 10000 965.64 ns 11.597 ns 10.848 ns 0.10
List NativeAOT 8.0 NativeAOT 8.0 10000 970.69 ns 17.986 ns 16.825 ns 0.10

Arm64

BenchmarkDotNet v0.13.6, macOS Sonoma 14.0 (23A5286i) [Darwin 23.0.0]
Apple M1 Pro, 1 CPU, 8 logical and 8 physical cores
.NET SDK 8.0.100-rc.1.23371.8
  [Host]        : .NET 8.0.0 (8.0.23.37103), Arm64 RyuJIT AdvSIMD
  .NET 7.0      : .NET 7.0.1 (7.0.122.56804), Arm64 RyuJIT AdvSIMD
  .NET 8.0      : .NET 8.0.0 (8.0.23.37103), Arm64 RyuJIT AdvSIMD
  NativeAOT 8.0 : .NET 8.0.0-rc.1.23371.3, Arm64 NativeAOT AdvSIMD
Method Job Runtime Length Mean Error StdDev Ratio
Array .NET 7.0 .NET 7.0 10 39.13 ns 0.506 ns 0.395 ns 1.00
Array .NET 8.0 .NET 8.0 10 13.30 ns 0.018 ns 0.016 ns 0.34
Array NativeAOT 8.0 NativeAOT 8.0 10 16.24 ns 0.028 ns 0.023 ns 0.41
List .NET 7.0 .NET 7.0 10 26.45 ns 0.037 ns 0.035 ns 1.00
List .NET 8.0 .NET 8.0 10 17.99 ns 0.031 ns 0.029 ns 0.68
List NativeAOT 8.0 NativeAOT 8.0 10 22.34 ns 0.047 ns 0.044 ns 0.84
Array .NET 7.0 .NET 7.0 100 58.72 ns 0.163 ns 0.152 ns 1.00
Array .NET 8.0 .NET 8.0 100 31.41 ns 0.058 ns 0.048 ns 0.53
Array NativeAOT 8.0 NativeAOT 8.0 100 32.82 ns 0.082 ns 0.073 ns 0.56
List .NET 7.0 .NET 7.0 100 109.95 ns 0.158 ns 0.140 ns 1.00
List .NET 8.0 .NET 8.0 100 37.40 ns 0.101 ns 0.094 ns 0.34
List NativeAOT 8.0 NativeAOT 8.0 100 40.23 ns 0.375 ns 0.351 ns 0.37
Array .NET 7.0 .NET 7.0 1000 469.01 ns 3.150 ns 2.947 ns 1.00
Array .NET 8.0 .NET 8.0 1000 266.25 ns 0.453 ns 0.424 ns 0.57
Array NativeAOT 8.0 NativeAOT 8.0 1000 249.36 ns 0.383 ns 0.319 ns 0.53
List .NET 7.0 .NET 7.0 1000 856.08 ns 3.760 ns 3.140 ns 1.00
List .NET 8.0 .NET 8.0 1000 278.62 ns 0.551 ns 0.460 ns 0.33
List NativeAOT 8.0 NativeAOT 8.0 1000 259.52 ns 0.809 ns 0.756 ns 0.30
Array .NET 7.0 .NET 7.0 10000 4,111.95 ns 68.902 ns 64.451 ns 1.00
Array .NET 8.0 .NET 8.0 10000 2,420.28 ns 13.809 ns 12.917 ns 0.59
Array NativeAOT 8.0 NativeAOT 8.0 10000 2,284.52 ns 7.150 ns 6.338 ns 0.56
List .NET 7.0 .NET 7.0 10000 7,661.80 ns 19.279 ns 18.034 ns 1.00
List .NET 8.0 .NET 8.0 10000 2,581.48 ns 15.425 ns 14.429 ns 0.34
List NativeAOT 8.0 NativeAOT 8.0 10000 2,487.72 ns 11.772 ns 10.436 ns 0.32

@neon-sunset neon-sunset deleted the simd-range-init branch July 23, 2023 16:07
@dotnet dotnet locked as resolved and limited conversation to collaborators Aug 22, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-System.Linq community-contribution Indicates that the PR has been added by a community member
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants