Vectorize Enumerable.Range initialization, take 2 #87992

neon-sunset · 2023-06-24T03:13:01Z

Previous attempt: #80633

This one fixes the regression on x86_64 by avoiding an SDIV and having smaller method body.

Diffs:

x86_64 https://www.diffchecker.com/uz0rCGdv/
Aarch64 https://www.diffchecker.com/2Czwh3OI/

ghost · 2023-06-24T03:13:12Z

Tagging subscribers to this area: @dotnet/area-system-linq
See info in area-owners.md if you want to be subscribed.

Issue Details

Previous attempt: #80633

This one fixes the regression on x86_64 by avoiding an SDIV and having smaller method body.

Diffs:

x86_64 https://www.diffchecker.com/uz0rCGdv/
Aarch64 https://www.diffchecker.com/2Czwh3OI/

Author:	neon-sunset
Assignees:	-
Labels:	`area-System.Linq`, `community-contribution`
Milestone:	-

src/libraries/System.Linq/src/System/Linq/Range.SpeedOpt.cs

EgorBo · 2023-06-24T15:04:37Z

src/libraries/System.Linq/src/System/Linq/Range.SpeedOpt.cs

+                    goto Scalar;
+                }
+
+                int stride = Vector256<int>.Count;


Any reason to use V256 instead of V128 to support both x86 and arm?

Please check the Aarch64 diff in the description. The loop is written in such a way that Vector256 codegen looks exactly like Vector128x2, because 2x unrolling seems to be advantageous in case of NEON, and LLVM does it for Rust's (0..n).collect() (two adds + stp).

This is a bit unconventional compared to all other SIMD code in CoreLib, but I wanted to try to keep the size of the code to a minimum because this is System.Linq.

Ok what's the codegen for e.g. SSE2 (NativeAOT's default target)?

Also, I don't see any guards for SIMD at all - arm32, linux-x86, mono-interp, HWINTRINSIC=0

Ok what's the codegen for e.g. SSE2 (NativeAOT's default target)?

Let me check. Also, is it really as low as SSE2? That would be unfortunate.

Do we use this approach anywhere else in BCL? Last time we discussed this, V256 via V128x2 was mainly used for testing only. Does mono handle it well?

How do you want the implementation to look like? Full span.IndexOf style with V256 -> V128 -> Scalar blocks?

Does mono handle it well?

No idea, do you know if mono handles it well?

How do you want the implementation to look like?

I am fine with the codegen you posted - I am just not aware if this approach is recommended/used anywhere. cc @tannergooding

How do you want the implementation to look like?

Has something changed since the previous PR? My feedback there was that without motivating scenarios, this wasn't worth vectorizing at all. Can you point to places where Enumerable.Range(...).ToArray/ToList is being used in production code on perf-sensitive code paths?

Thank you for reviewing the change. Here are the concerns listed in the original PR that were stating the impl. was not worth it:

Negative impact on code size

Negative impact on maintainability

Negative impact on un-vectorized path

I tried to address the first two by keeping the implementation as simple as possible (+35 lines) and evaluating codegen size diff, which is +60B on x86_64 and +100B on Aarch64. The third one is addressed by getting rid of length % Vector<int>.Count which was causing regressions on x86_64.

It is true that the pattern most often used in tests, but I don't agree with the assertion that it is not a useful change (at some point you do start to care about test execution time, so it is nice to have 2-4x speed up for moderately used path). There are also references to non-test code that could benefit:

https://github.com/fabsenet/adrilight/blob/main/adrilight/ViewModel/SettingsViewModel.cs#L55-L56

https://github.com/SciSharp/Pandas.NET/blob/master/src/Pandas.NET/Methods/Pandas.read_csv.cs#L43

https://github.com/mdabros/SharpLearning/blob/master/src/SharpLearning.GradientBoost/Learners/ClassificationGradientBoostLearner.cs#L241

Matches: https://grep.app/search?q=Enumerable%5C.Range%5C%28%5B%5E%29%5D%2B%5C%29%5C.%28ToArray%7CToList%29%5C%28%5C%29&regexp=true&case=true&filter[lang][0]=C%23 (22 pages of results)

There is also a precedent for having SIMD code in System.Linq: https://github.com/dotnet/runtime/blob/main/src/libraries/System.Linq/src/System/Linq/Sum.cs#L75

stephentoub · 2023-07-05T03:01:59Z

src/libraries/System.Linq/src/System/Linq/Range.SpeedOpt.cs

                return list;
            }

-            private static void Fill(Span<int> destination, int value)
+            [MethodImpl(MethodImplOptions.AggressiveInlining)]


I don't think this should be aggressively inlined.

src/libraries/System.Linq/src/System/Linq/Range.SpeedOpt.cs

stephentoub · 2023-07-05T03:35:54Z

src/libraries/System.Linq/src/System/Linq/Range.SpeedOpt.cs

            {
-                for (int i = 0; i < destination.Length; i++, value++)
+                // The improvements in the compiler now allow us to write the loop as Vector256.
+                // Manually unrolling to Vector128x2 for SSE2/NEON is no longer profitable.


Can we just use Vector<T>? e.g.

static void Fill(Span<int> destination, int value) { Debug.Assert(Vector<int>.Count * sizeof(int) * 8 <= 512); ref int c = ref MemoryMarshal.GetReference(destination); ref int end = ref Unsafe.Add(ref c, destination.Length); if (Vector.IsHardwareAccelerated && destination.Length >= Vector<int>.Count) { var init = new Vector<int>((ReadOnlySpan<int>)new int[] { 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15 }); var increment = new Vector<int>(Vector<int>.Count); var current = new Vector<int>(value) + init; ref int oneVectorFromEnd = ref Unsafe.Subtract(ref end, Vector<int>.Count); do { current.StoreUnsafe(ref c); current += increment; c = ref Unsafe.Add(ref c, Vector<int>.Count); } while (!Unsafe.IsAddressLessThan(ref oneVectorFromEnd, ref c)); value = current[0]; } while (Unsafe.IsAddressLessThan(ref c, ref end)) { c = value++; c = ref Unsafe.Add(ref c, 1); } }

It is definitely an option, but I initially opted out of it to have better performance on ARM64. As of now, there are pros and cons.

Vector<T> variant
Pros:

Takes advantage of AVX512

Does not rely on JIT/ILC being able to perfectly optimize V256 as V128x2 for AdvSimd/SSE2

Consistent with how other vectorized code is written

Better performance on Mono? (as @EgorBo mentioned, Mono regressions are an issue that needs to be tracked)

Cons:

Does not help 0..15 i.e. Enumerable.Range(0, 10).ToArray() , which is fairly common, on AVX512

Has reduced gains on wide ARM64 cores (M series can dispatch V128x2 loop iteration in a single cycle, ARM Cortex-X3/4 should be able to do the same). This also applies to default NativeAOT target which would use 128x2 with SSE2. I expect x86_64 behavior to be very similar.

Do you think Vector<T> is better for the time being?

Takes advantage of AVX512

Not by default. By default, Vector<T> will be either 128-bits or 256-bits.

Do you think Vector is better for the time being?

Yes

neon-sunset · 2023-07-06T13:16:05Z

@stephentoub I have addressed the feedback but kept AggressiveInlining because Fill will get inlined into ToArray, ToList and CopyTo which are behind virtual calls and won't get inlined further. Diff example: https://www.diffchecker.com/MWZRN2Sy/

stephentoub · 2023-07-06T13:55:52Z

which are behind virtual calls and won't get inlined further

They can be with dynamic pgo. Please remove the aggressive inlining.

neon-sunset · 2023-07-06T14:19:12Z

They can be with dynamic pgo. Please remove the aggressive inlining.

Sorry for the confusion, I missed the attribute when force-pushing. The change is already without it, as you requested.

src/libraries/System.Linq/src/System/Linq/Range.SpeedOpt.cs

stephentoub · 2023-07-21T15:10:52Z

Thanks.

neon-sunset · 2023-07-23T16:07:35Z

@stephentoub thank you for merging this and apologies for not addressing the feedback previously!
The merged change yields 1.5-4x speed-up vs 7.0 depending on the range length (and up to 10x for List thanks to CollectionsMarshal.SetCount):

x86_64

BenchmarkDotNet v0.13.6, Windows 11 (10.0.22621.1992/22H2/2022Update/SunValley2)
AMD Ryzen 7 5800X, 1 CPU, 16 logical and 8 physical cores
.NET SDK 8.0.100-rc.1.23371.8
  [Host]        : .NET 8.0.0 (8.0.23.37103), X64 RyuJIT AVX2
  .NET 7.0      : .NET 7.0.9 (7.0.923.32018), X64 RyuJIT AVX2
  .NET 8.0      : .NET 8.0.0 (8.0.23.37103), X64 RyuJIT AVX2
  NativeAOT 8.0 : .NET 8.0.0-rc.1.23371.3, X64 NativeAOT AVX2

Method	Job	Runtime	Length	Mean	Error	StdDev	Ratio
Array	.NET 7.0	.NET 7.0	10	14.09 ns	0.125 ns	0.111 ns	1.00
Array	.NET 8.0	.NET 8.0	10	11.89 ns	0.124 ns	0.104 ns	0.84
Array	NativeAOT 8.0	NativeAOT 8.0	10	11.66 ns	0.142 ns	0.133 ns	0.83

List	.NET 7.0	.NET 7.0	10	22.10 ns	0.196 ns	0.183 ns	1.00
List	.NET 8.0	.NET 8.0	10	15.41 ns	0.161 ns	0.151 ns	0.70
List	NativeAOT 8.0	NativeAOT 8.0	10	20.20 ns	0.132 ns	0.103 ns	0.92

Array	.NET 7.0	.NET 7.0	100	45.49 ns	0.481 ns	0.376 ns	1.00
Array	.NET 8.0	.NET 8.0	100	22.23 ns	0.264 ns	0.234 ns	0.49
Array	NativeAOT 8.0	NativeAOT 8.0	100	22.36 ns	0.374 ns	0.331 ns	0.49

List	.NET 7.0	.NET 7.0	100	112.15 ns	1.528 ns	1.276 ns	1.00
List	.NET 8.0	.NET 8.0	100	25.09 ns	0.215 ns	0.191 ns	0.22
List	NativeAOT 8.0	NativeAOT 8.0	100	32.31 ns	0.378 ns	0.335 ns	0.29

Array	.NET 7.0	.NET 7.0	1000	356.33 ns	3.589 ns	3.358 ns	1.00
Array	.NET 8.0	.NET 8.0	1000	117.84 ns	2.250 ns	2.105 ns	0.33
Array	NativeAOT 8.0	NativeAOT 8.0	1000	121.89 ns	1.932 ns	1.807 ns	0.34

List	.NET 7.0	.NET 7.0	1000	984.15 ns	7.978 ns	6.662 ns	1.00
List	.NET 8.0	.NET 8.0	1000	125.00 ns	2.064 ns	1.931 ns	0.13
List	NativeAOT 8.0	NativeAOT 8.0	1000	133.36 ns	1.298 ns	1.214 ns	0.14

Array	.NET 7.0	.NET 7.0	10000	3,209.78 ns	21.071 ns	19.709 ns	1.00
Array	.NET 8.0	.NET 8.0	10000	958.51 ns	18.953 ns	21.067 ns	0.30
Array	NativeAOT 8.0	NativeAOT 8.0	10000	947.25 ns	16.773 ns	14.007 ns	0.29

List	.NET 7.0	.NET 7.0	10000	9,387.75 ns	86.637 ns	81.040 ns	1.00
List	.NET 8.0	.NET 8.0	10000	965.64 ns	11.597 ns	10.848 ns	0.10
List	NativeAOT 8.0	NativeAOT 8.0	10000	970.69 ns	17.986 ns	16.825 ns	0.10

Arm64

BenchmarkDotNet v0.13.6, macOS Sonoma 14.0 (23A5286i) [Darwin 23.0.0]
Apple M1 Pro, 1 CPU, 8 logical and 8 physical cores
.NET SDK 8.0.100-rc.1.23371.8
  [Host]        : .NET 8.0.0 (8.0.23.37103), Arm64 RyuJIT AdvSIMD
  .NET 7.0      : .NET 7.0.1 (7.0.122.56804), Arm64 RyuJIT AdvSIMD
  .NET 8.0      : .NET 8.0.0 (8.0.23.37103), Arm64 RyuJIT AdvSIMD
  NativeAOT 8.0 : .NET 8.0.0-rc.1.23371.3, Arm64 NativeAOT AdvSIMD

Method	Job	Runtime	Length	Mean	Error	StdDev	Ratio
Array	.NET 7.0	.NET 7.0	10	39.13 ns	0.506 ns	0.395 ns	1.00
Array	.NET 8.0	.NET 8.0	10	13.30 ns	0.018 ns	0.016 ns	0.34
Array	NativeAOT 8.0	NativeAOT 8.0	10	16.24 ns	0.028 ns	0.023 ns	0.41

List	.NET 7.0	.NET 7.0	10	26.45 ns	0.037 ns	0.035 ns	1.00
List	.NET 8.0	.NET 8.0	10	17.99 ns	0.031 ns	0.029 ns	0.68
List	NativeAOT 8.0	NativeAOT 8.0	10	22.34 ns	0.047 ns	0.044 ns	0.84

Array	.NET 7.0	.NET 7.0	100	58.72 ns	0.163 ns	0.152 ns	1.00
Array	.NET 8.0	.NET 8.0	100	31.41 ns	0.058 ns	0.048 ns	0.53
Array	NativeAOT 8.0	NativeAOT 8.0	100	32.82 ns	0.082 ns	0.073 ns	0.56

List	.NET 7.0	.NET 7.0	100	109.95 ns	0.158 ns	0.140 ns	1.00
List	.NET 8.0	.NET 8.0	100	37.40 ns	0.101 ns	0.094 ns	0.34
List	NativeAOT 8.0	NativeAOT 8.0	100	40.23 ns	0.375 ns	0.351 ns	0.37

Array	.NET 7.0	.NET 7.0	1000	469.01 ns	3.150 ns	2.947 ns	1.00
Array	.NET 8.0	.NET 8.0	1000	266.25 ns	0.453 ns	0.424 ns	0.57
Array	NativeAOT 8.0	NativeAOT 8.0	1000	249.36 ns	0.383 ns	0.319 ns	0.53

List	.NET 7.0	.NET 7.0	1000	856.08 ns	3.760 ns	3.140 ns	1.00
List	.NET 8.0	.NET 8.0	1000	278.62 ns	0.551 ns	0.460 ns	0.33
List	NativeAOT 8.0	NativeAOT 8.0	1000	259.52 ns	0.809 ns	0.756 ns	0.30

Array	.NET 7.0	.NET 7.0	10000	4,111.95 ns	68.902 ns	64.451 ns	1.00
Array	.NET 8.0	.NET 8.0	10000	2,420.28 ns	13.809 ns	12.917 ns	0.59
Array	NativeAOT 8.0	NativeAOT 8.0	10000	2,284.52 ns	7.150 ns	6.338 ns	0.56

List	.NET 7.0	.NET 7.0	10000	7,661.80 ns	19.279 ns	18.034 ns	1.00
List	.NET 8.0	.NET 8.0	10000	2,581.48 ns	15.425 ns	14.429 ns	0.34
List	NativeAOT 8.0	NativeAOT 8.0	10000	2,487.72 ns	11.772 ns	10.436 ns	0.32

ghost added the community-contribution Indicates that the PR has been added by a community member label Jun 24, 2023

dotnet-issue-labeler bot added the area-System.Linq label Jun 24, 2023

gfoidl reviewed Jun 24, 2023

View reviewed changes

src/libraries/System.Linq/src/System/Linq/Range.SpeedOpt.cs Outdated Show resolved Hide resolved

src/libraries/System.Linq/src/System/Linq/Range.SpeedOpt.cs Outdated Show resolved Hide resolved

EgorBo reviewed Jun 24, 2023

View reviewed changes

src/libraries/System.Linq/src/System/Linq/Range.SpeedOpt.cs Outdated Show resolved Hide resolved

EgorBo reviewed Jun 24, 2023

View reviewed changes

stephentoub reviewed Jul 5, 2023

View reviewed changes

src/libraries/System.Linq/src/System/Linq/Range.SpeedOpt.cs Show resolved Hide resolved

stephentoub reviewed Jul 5, 2023

View reviewed changes

neon-sunset force-pushed the simd-range-init branch 2 times, most recently from 342ad85 to 6165aa3 Compare July 6, 2023 12:31

Vectorize Enumerable.Range initialization

4891e8a

neon-sunset force-pushed the simd-range-init branch from 6165aa3 to 4891e8a Compare July 6, 2023 12:46

stephentoub reviewed Jul 6, 2023

View reviewed changes

src/libraries/System.Linq/src/System/Linq/Range.SpeedOpt.cs Outdated Show resolved Hide resolved

stephentoub reviewed Jul 6, 2023

View reviewed changes

src/libraries/System.Linq/src/System/Linq/Range.SpeedOpt.cs Outdated Show resolved Hide resolved

stephentoub reviewed Jul 6, 2023

View reviewed changes

src/libraries/System.Linq/src/System/Linq/Range.SpeedOpt.cs Outdated Show resolved Hide resolved

stephentoub added 2 commits July 20, 2023 22:18

Merge branch 'main' into simd-range-init

28e08cf

Address PR feedback

5bbf41c

stephentoub approved these changes Jul 21, 2023

View reviewed changes

stephentoub merged commit f016dc4 into dotnet:main Jul 21, 2023
103 checks passed

neon-sunset deleted the simd-range-init branch July 23, 2023 16:07

dotnet locked as resolved and limited conversation to collaborators Aug 22, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Vectorize Enumerable.Range initialization, take 2 #87992

Vectorize Enumerable.Range initialization, take 2 #87992

neon-sunset commented Jun 24, 2023

ghost commented Jun 24, 2023

EgorBo Jun 24, 2023

neon-sunset Jun 24, 2023 •

edited

EgorBo Jun 24, 2023

EgorBo Jun 24, 2023

neon-sunset Jun 24, 2023 •

edited

EgorBo Jun 24, 2023 •

edited

neon-sunset Jun 24, 2023 •

edited

EgorBo Jun 24, 2023

stephentoub Jun 24, 2023

neon-sunset Jun 24, 2023 •

edited

stephentoub Jul 5, 2023

stephentoub Jul 5, 2023

neon-sunset Jul 6, 2023

stephentoub Jul 6, 2023

neon-sunset commented Jul 6, 2023 •

edited

stephentoub commented Jul 6, 2023

neon-sunset commented Jul 6, 2023

stephentoub commented Jul 21, 2023

neon-sunset commented Jul 23, 2023

Vectorize Enumerable.Range initialization, take 2 #87992

Vectorize Enumerable.Range initialization, take 2 #87992

Conversation

neon-sunset commented Jun 24, 2023

ghost commented Jun 24, 2023

Choose a reason for hiding this comment

neon-sunset Jun 24, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

neon-sunset Jun 24, 2023 • edited

Choose a reason for hiding this comment

EgorBo Jun 24, 2023 • edited

Choose a reason for hiding this comment

neon-sunset Jun 24, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

neon-sunset Jun 24, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

neon-sunset commented Jul 6, 2023 • edited

stephentoub commented Jul 6, 2023

neon-sunset commented Jul 6, 2023

stephentoub commented Jul 21, 2023

neon-sunset commented Jul 23, 2023

x86_64

Arm64

neon-sunset Jun 24, 2023 •

edited

neon-sunset Jun 24, 2023 •

edited

EgorBo Jun 24, 2023 •

edited

neon-sunset Jun 24, 2023 •

edited

neon-sunset Jun 24, 2023 •

edited

neon-sunset commented Jul 6, 2023 •

edited