Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Switch to iSimdVector and Align WidenAsciiToUtf16 #99982

Merged
merged 5 commits into from Apr 8, 2024

Conversation

DeepakRajendrakumaran
Copy link
Contributor

@DeepakRajendrakumaran DeepakRajendrakumaran commented Mar 19, 2024

This is an updated version of this PR(#89892). It does the following

  1. Add ' AnyMatches' support for iSimdVector
  2. Use iSimdVector to clean up 'WidenAsciiToUtf16' implementation
  3. Align memory stores

Perf Results

Ran the following tests(sizes : 16, 512, 1024, 5120, 10240) on EMR: (base = main branch, diff = with change)https://github.com/dotnet/performance/blob/47d21ee9571164a8e3f8088d8709ca4061d96827/src/benchmarks/micro/libraries/System.Text.Encoding/Perf.Encoding.cs

On EMR
image

On ICX - Not much diff
image

@DeepakRajendrakumaran
Copy link
Contributor Author

DeepakRajendrakumaran commented Mar 20, 2024

@tannergooding @dotnet/avx512-contrib Can you please review this?

private static unsafe bool HasMatch<TVectorByte>(TVectorByte vector)
where TVectorByte : unmanaged, ISimdVector<TVectorByte, byte>
{
return !(vector & TVectorByte.Create((byte)0x80)).Equals(TVectorByte.Zero);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not (vector & TVectorByte.Create((byte)0x80)) != TVectorByte.Zero?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I basically ran into a weird issue where perf degraded significantly for specific cases on my ICX when I had '(vector & TVectorByte.Create((byte)0x80)) != TVectorByte.Zero' and the issue went away with what I have now. It was pretty consistent. I had some trouble narrowing down the exact why with VTune though.

I decided to go with the performant version for now

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was there a codegen difference between them? I'd expect them to generate the same code

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did a quick check on godbolt and they all look the same - https://godbolt.org/z/P87dPdTGa

Now I'm curious if it was just something off when I ran it locally

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did some more digging and something is off. Tried 3 versions with a handwritten benchmark
image
Benchmark
image

Match3 is significantly faster than Match1 and Match 2

VTune for Match1 vs Match3
image

The inlining makes it hard to narrow down

image

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have a strong preference for any of these patterns?

I can look into the why this is happening if it's important. For now, I'm just keeping the fast version

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tannergooding I have created an issue detailing this as discussed: #100493

Please let me know if there is anything else needed for this PR

Comment on lines 2164 to 2196
if (!HasMatch<TVectorByte>(asciiVector))
{
(TVectorUShort utf16LowVector, TVectorUShort utf16HighVector) = Widen<TVectorByte, TVectorUShort>(asciiVector);
utf16LowVector.Store(pCurrentWriteAddress);
utf16HighVector.Store(pCurrentWriteAddress + TVectorUShort.Count);
pCurrentWriteAddress += (nuint)(TVectorUShort.Count * 2);
if (((int)pCurrentWriteAddress & 1) == 0)
{
// Bump write buffer up to the next aligned boundary
pCurrentWriteAddress = (ushort*)((nuint)pCurrentWriteAddress & ~(nuint)(TVectorUShort.Alignment - 1));
nuint numBytesWritten = (nuint)pCurrentWriteAddress - (nuint)pUtf16Buffer;
currentOffset += (nuint)numBytesWritten / 2;
}
else
{
// If input isn't char aligned, we won't be able to align it to a Vector
currentOffset += (nuint)TVectorByte.Count;
}
while (currentOffset <= finalOffsetWhereCanRunLoop)
{
asciiVector = TVectorByte.Load(pAsciiBuffer + currentOffset);
if (HasMatch<TVectorByte>(asciiVector))
{
break;
}
(utf16LowVector, utf16HighVector) = Widen<TVectorByte, TVectorUShort>(asciiVector);
utf16LowVector.StoreAligned(pCurrentWriteAddress);
utf16HighVector.StoreAligned(pCurrentWriteAddress + TVectorUShort.Count);
currentOffset += (nuint)TVectorByte.Count;
pCurrentWriteAddress += (nuint)(TVectorUShort.Count * 2);
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code here looks generally good, and I don't expect anything to be changed for this PR based on what I'm about to comment.

However, I would like to refer to how we set up everything for TensorPrimitives as it works very well and allows a lot of code sharing (noting it isn't using ISimdVector yet since it's out of band, but could easily do so in the future): https://source.dot.net/#System.Numerics.Tensors/System/Numerics/Tensors/netcore/Common/TensorPrimitives.IUnaryOperator.cs,fdab74764af40a1e

In general it tries to do the pre-checks up front and only ever execute 1 vector path (so it doesn't have to fallthrough from Vector512->Vector256->Vector128 as remainders exist). To achieve this, it has the Vectorized### helpers (which are all identical, except for the size they operate on, this is what would eventually use ISimdVector) and then a shared VectorizedSmall which is simply a jump table designed to handle any data that is less than a full vector using a single branch.

The core logic for the vectorized algorithm (https://source.dot.net/#System.Numerics.Tensors/System/Numerics/Tensors/netcore/Common/TensorPrimitives.IUnaryOperator.cs,ef9adce4e9561b04) then basically has a path to handle the main loop (which is currently unrolled by a factor of 8) and otherwise hits a jump table to handle remaining blocks (so they can likewise be handled with a single branch).

To help optimize, it preloads the beginning and ending vectors. In the worst case this will result in double processing of some inputs for very small sizes, but its ultimately only 2 main operations which is fine.

The main loop then attempts to align and has an optimized path for extremely large inputs for non-temporal data if alignment could be achieved. Smaller inputs just do regular unaligned stores since the actual address will have been aligned if that was feasible.

This general approach is done because it allows all paths, but particularly the smallest inputs, to minimize the total number of branches done (no more than 2 branches for non-vectorized data and no more 3 to hit the main loop for vectorized code). It also allows us to separate the "algorithm logic" from the "vectorization logic" and share that vectorization logic between multiple vectorized algorithms.

This general setup has worked so well and provided very stable perf numbers for all sizes, such that we opened #93217 as a means of investigating if we could make it more general purpose and public. Long term, it'd probably be desirable to move algorithms like this WidenAsciiToUtf16 to follow the same approach so that we get the best perf, with the least overhead.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is really useful. And something I can add on in future. Thanks for the detailed comment

@@ -692,6 +692,11 @@ private string ToString([StringSyntax(StringSyntaxAttribute.NumericFormat)] stri
// New Surface Area
//

static bool ISimdVector<Vector128<T>, T>.AnyMatches(Vector128<T> vector)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We did review/approve this (#98055 (comment)) and settled on bool Any(Vector128<T> vector, T value) and bool AnyWhereAllBitsSet(Vector<T> vector)

It would be nice to fix this to follow that. The PR otherwise looks good and should be mergeable.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, the way I understand is we do not need AnyMatches() anymore

And Any and AnyWhereAllBitsSet would look something like follows


static bool ISimdVector<Vector512<T>, T>.AnyWhereAllBitsSet(Vector512<T> vector)
 {
            return (vector.EqualsAny(Vector512<T>.AllBitsSet));
 }
static bool ISimdVector<Vector512<T>, T>.Any(Vector512<T> vector, T value)
{
           return (vector.EqualsAny(Vector512.Create((T)value)));
 }

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, essentially, with the ability for the JIT to recognize these as intrinsic and optimize them more in appropriate scenarios (but that's not necessary for this PR)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done - took out AnyMatches() and added AnyWhereAllBitsSet() and Any()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tannergooding Does this look correct? Anything else you'd like me to fix?

@@ -692,6 +692,16 @@ private string ToString([StringSyntax(StringSyntaxAttribute.NumericFormat)] stri
// New Surface Area
//

static bool ISimdVector<Vector128<T>, T>.AnyWhereAllBitsSet(Vector128<T> vector)
{
return (Vector128.EqualsAny(vector, Vector128<T>.AllBitsSet));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: unnecessary parens here and in Any

If you could fix that in a follow up PR, that'd be great (going to merge this)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good! Will put it up later today.

@DrewScoggins
Copy link
Member

DrewScoggins commented Apr 11, 2024

Regressions
Linux Ampere arm64: #100922
Linux x64: dotnet/perf-autofiling-issues#32721
Windows x64: dotnet/perf-autofiling-issues#32733

@tannergooding @DeepakRajendrakumaran

@matouskozak
Copy link
Member

matouskozak commented Apr 16, 2024

@lewing
Copy link
Member

lewing commented Apr 20, 2024

mono wasm aot regression dotnet/perf-autofiling-issues#32601

matouskozak pushed a commit to matouskozak/runtime that referenced this pull request Apr 30, 2024
* Add AnyMatches() to iSimdVector interface

* Switch to iSimdVector and Align WidenAsciiToUtf16.

* Fixing perf

* Addressing Review Comments.

* Mirroring API change : dotnet#98055 (comment)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area-System.Runtime.Intrinsics community-contribution Indicates that the PR has been added by a community member
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants