New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Vectorize Enumerable.Range initialization, take 2 #87992
Conversation
Tagging subscribers to this area: @dotnet/area-system-linq Issue DetailsPrevious attempt: #80633 This one fixes the regression on x86_64 by avoiding an Diffs:
|
goto Scalar; | ||
} | ||
|
||
int stride = Vector256<int>.Count; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any reason to use V256 instead of V128 to support both x86 and arm?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please check the Aarch64 diff in the description. The loop is written in such a way that Vector256 codegen looks exactly like Vector128x2, because 2x unrolling seems to be advantageous in case of NEON, and LLVM does it for Rust's (0..n).collect()
(two add
s + stp
).
This is a bit unconventional compared to all other SIMD code in CoreLib, but I wanted to try to keep the size of the code to a minimum because this is System.Linq
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok what's the codegen for e.g. SSE2 (NativeAOT's default target)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, I don't see any guards for SIMD at all - arm32, linux-x86, mono-interp, HWINTRINSIC=0
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok what's the codegen for e.g. SSE2 (NativeAOT's default target)?
Let me check. Also, is it really as low as SSE2? That would be unfortunate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we use this approach anywhere else in BCL? Last time we discussed this, V256 via V128x2 was mainly used for testing only. Does mono handle it well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How do you want the implementation to look like? Full span.IndexOf style with V256 -> V128 -> Scalar blocks?
Does mono handle it well?
No idea, do you know if mono handles it well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How do you want the implementation to look like?
I am fine with the codegen you posted - I am just not aware if this approach is recommended/used anywhere. cc @tannergooding
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How do you want the implementation to look like?
Has something changed since the previous PR? My feedback there was that without motivating scenarios, this wasn't worth vectorizing at all. Can you point to places where Enumerable.Range(...).ToArray/ToList
is being used in production code on perf-sensitive code paths?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for reviewing the change. Here are the concerns listed in the original PR that were stating the impl. was not worth it:
- Negative impact on code size
- Negative impact on maintainability
- Negative impact on un-vectorized path
I tried to address the first two by keeping the implementation as simple as possible (+35 lines) and evaluating codegen size diff, which is +60B on x86_64 and +100B on Aarch64. The third one is addressed by getting rid of length % Vector<int>.Count
which was causing regressions on x86_64.
It is true that the pattern most often used in tests, but I don't agree with the assertion that it is not a useful change (at some point you do start to care about test execution time, so it is nice to have 2-4x speed up for moderately used path). There are also references to non-test code that could benefit:
- https://github.com/fabsenet/adrilight/blob/main/adrilight/ViewModel/SettingsViewModel.cs#L55-L56
- https://github.com/SciSharp/Pandas.NET/blob/master/src/Pandas.NET/Methods/Pandas.read_csv.cs#L43
- https://github.com/mdabros/SharpLearning/blob/master/src/SharpLearning.GradientBoost/Learners/ClassificationGradientBoostLearner.cs#L241
Matches: https://grep.app/search?q=Enumerable%5C.Range%5C%28%5B%5E%29%5D%2B%5C%29%5C.%28ToArray%7CToList%29%5C%28%5C%29®exp=true&case=true&filter[lang][0]=C%23 (22 pages of results)
There is also a precedent for having SIMD code in System.Linq
: https://github.com/dotnet/runtime/blob/main/src/libraries/System.Linq/src/System/Linq/Sum.cs#L75
return list; | ||
} | ||
|
||
private static void Fill(Span<int> destination, int value) | ||
[MethodImpl(MethodImplOptions.AggressiveInlining)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this should be aggressively inlined.
{ | ||
for (int i = 0; i < destination.Length; i++, value++) | ||
// The improvements in the compiler now allow us to write the loop as Vector256. | ||
// Manually unrolling to Vector128x2 for SSE2/NEON is no longer profitable. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we just use Vector<T>
? e.g.
static void Fill(Span<int> destination, int value)
{
Debug.Assert(Vector<int>.Count * sizeof(int) * 8 <= 512);
ref int c = ref MemoryMarshal.GetReference(destination);
ref int end = ref Unsafe.Add(ref c, destination.Length);
if (Vector.IsHardwareAccelerated && destination.Length >= Vector<int>.Count)
{
var init = new Vector<int>((ReadOnlySpan<int>)new int[] { 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15 });
var increment = new Vector<int>(Vector<int>.Count);
var current = new Vector<int>(value) + init;
ref int oneVectorFromEnd = ref Unsafe.Subtract(ref end, Vector<int>.Count);
do
{
current.StoreUnsafe(ref c);
current += increment;
c = ref Unsafe.Add(ref c, Vector<int>.Count);
}
while (!Unsafe.IsAddressLessThan(ref oneVectorFromEnd, ref c));
value = current[0];
}
while (Unsafe.IsAddressLessThan(ref c, ref end))
{
c = value++;
c = ref Unsafe.Add(ref c, 1);
}
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is definitely an option, but I initially opted out of it to have better performance on ARM64. As of now, there are pros and cons.
Vector<T>
variant
Pros:
- Takes advantage of AVX512
- Does not rely on JIT/ILC being able to perfectly optimize V256 as V128x2 for AdvSimd/SSE2
- Consistent with how other vectorized code is written
- Better performance on Mono? (as @EgorBo mentioned, Mono regressions are an issue that needs to be tracked)
Cons:
- Does not help 0..15 i.e.
Enumerable.Range(0, 10).ToArray()
, which is fairly common, on AVX512 - Has reduced gains on wide ARM64 cores (M series can dispatch V128x2 loop iteration in a single cycle, ARM Cortex-X3/4 should be able to do the same). This also applies to default NativeAOT target which would use 128x2 with SSE2. I expect x86_64 behavior to be very similar.
Do you think Vector<T>
is better for the time being?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Takes advantage of AVX512
Not by default. By default, Vector<T>
will be either 128-bits or 256-bits.
Do you think Vector is better for the time being?
Yes
342ad85
to
6165aa3
Compare
6165aa3
to
4891e8a
Compare
@stephentoub I have addressed the feedback but kept |
They can be with dynamic pgo. Please remove the aggressive inlining. |
Sorry for the confusion, I missed the attribute when force-pushing. The change is already without it, as you requested. |
Thanks. |
@stephentoub thank you for merging this and apologies for not addressing the feedback previously! x86_64
Arm64
|
Previous attempt: #80633
This one fixes the regression on x86_64 by avoiding an
SDIV
and having smaller method body.Diffs: