Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Vectorize some Guid APIs (ctor, TryWriteBytes) #21336

Merged
merged 9 commits into from Dec 3, 2018

Conversation

Projects
None yet
5 participants
@EgorBo
Copy link
Contributor

commented Dec 3, 2018

  1. Guid.TryWriteBytes(Span<byte>):
[Benchmark]
[ArgumentsSource(nameof(TestData))]
public Span<byte> TryWriteBytes(Guid id)
{
    Span<byte> s = new byte[20];
    id.TryWriteBytes(s);
    return s;
}
Method Mean Error StdDev
TryWriteBytes_new 4.960 ns 0.0818 ns 0.0292 ns
TryWriteBytes_old 8.541 ns 0.0362 ns 0.0129 ns

  1. new Guid(byte[]) and new Guid(ReadOnlySpan<byte>):
[Benchmark]
[ArgumentsSource(nameof(TestData))]
public Guid GuidCtor(byte[] data)
{
    return new Guid(data);
}
Method Mean Error StdDev
GuidCtor_new 2.615 ns 0.0275 ns 0.0098 ns
GuidCtor_old 7.483 ns 0.0146 ns 0.0052 ns

  1. Guid.ToByteArray():
[Benchmark]
[ArgumentsSource(nameof(TestData))]
public byte[] ToByteArray(Guid id)
{
    return id.ToByteArray();
}
Method Mean Error StdDev
ToByteArray_new 3.569 ns 0.0192 ns 0.0068 ns
ToByteArray_old 6.265 ns 0.0776 ns 0.0202 ns

Environment:

BenchmarkDotNet=v0.11.3, OS=Windows 10.0.17134.407 (1803/April2018Update/Redstone4)
Intel Core i7-8700K CPU 3.70GHz (Coffee Lake), 1 CPU, 12 logical and 6 physical cores
Frequency=3609372 Hz, Resolution=277.0565 ns, Timer=TSC
.NET Core SDK=3.0.100-preview-009812
  [Host]     : .NET Core 3.0.0-preview-27122-01 (CoreCLR 4.6.27121.03, CoreFX 4.7.18.57103), 64bit RyuJIT
  Job-CIYVXD : .NET Core 3.0.0-preview-27122-01 (CoreCLR 4.6.27121.03, CoreFX 4.7.18.57103), 64bit RyuJIT

IterationCount=6  WarmupCount=6  
@@ -841,8 +841,12 @@ public bool Equals(Guid g)
// Now compare each of the elements
return g._a == _a &&
Unsafe.Add(ref g._a, 1) == Unsafe.Add(ref _a, 1) &&
#if BIT64

This comment has been minimized.

Copy link
@jkotas

jkotas Dec 3, 2018

Member

Why not to do this for the first check as well?

This comment has been minimized.

Copy link
@EgorBo

EgorBo Dec 3, 2018

Author Contributor

@jkotas in case if two guids are different (both random) - it becomes slower (see Equals_2 in the table - it does two ulong-checks).

This comment has been minimized.

Copy link
@EgorBo

EgorBo Dec 3, 2018

Author Contributor
  • changed to be two ulong *
@@ -841,8 +841,12 @@ public bool Equals(Guid g)
// Now compare each of the elements
return g._a == _a &&
Unsafe.Add(ref g._a, 1) == Unsafe.Add(ref _a, 1) &&
#if BIT64
Unsafe.Add(ref Unsafe.As<int, ulong>(ref g._a), 1) == Unsafe.Add(ref Unsafe.As<int, ulong>(ref _a), 1);

This comment has been minimized.

Copy link
@jkotas

jkotas Dec 3, 2018

Member

This has potential portability problem: It assumes that accessing misaligned ulong is ok.

This comment has been minimized.

Copy link
@EgorBo

EgorBo Dec 3, 2018

Author Contributor

I agree 🙁

This comment has been minimized.

Copy link
@jkotas

jkotas Dec 3, 2018

Member

So this should rather use Unsafe.ReadUnaligned

@jkotas

This comment has been minimized.

Copy link
Member

commented Dec 3, 2018

https://gist.github.com/EgorBo/75348d5f194cbe91342d1b77d73b6c04

I has AggressiveInlining on everything. The actual implementation does not have it. Are the results representative?

@EgorBo

This comment has been minimized.

Copy link
Contributor Author

commented Dec 3, 2018

@jkotas oh, indeed, let me re-run the benchmark without it. Initially I wanted to replace duplicated code to compare guids with Equals with [AggressiveInline] e.g. https://github.com/dotnet/coreclr/blob/master/src/System.Private.CoreLib/shared/System/Guid.cs#L833-L845 (+ operator ==/!=)

@jkotas

This comment has been minimized.

Copy link
Member

commented Dec 3, 2018

@EgorBo

This comment has been minimized.

Copy link
Contributor Author

commented Dec 3, 2018

hm.. without AggressiveInlining results are a little bit different:
image
comparing Guids as two ulongs is the most efficient way now (tried several times on a clean env).

PS: probably it makes sense to mark this method with AggressiveInlining anyway.

EgorBo added some commits Dec 3, 2018

@EgorBo

This comment has been minimized.

Copy link
Contributor Author

commented Dec 3, 2018

Turns out WriteByteHelper can be optimized with SSE2 (it makes it twice faster in my cases).

destination[13] = _i;
destination[14] = _j;
destination[15] = _k;
if (Sse2.IsSupported)

This comment has been minimized.

Copy link
@jkotas

jkotas Dec 3, 2018

Member

I do not think it make sense to be special casing a simple copying like this for SSE2. Instead, this should be special cased for all little-endian platforms, like:

if (BitConverter.IsLittleEndian)
{
    MemoryMarshal.Write(destination, ref this);
}
else
{
...

This comment has been minimized.

Copy link
@EgorBo

EgorBo Dec 3, 2018

Author Contributor

nice, it's even faster than the sse impl!

This comment has been minimized.

Copy link
@jkotas

jkotas Dec 3, 2018

Member

Folding WriteByteHelper into TryWriteBytes would make this even leaner for little endian platforms:

public byte[] ToByteArray()
{
    var g = new byte[16];
    if (BitConverter.IsLittleEndian)
    {
        MemoryMarshal.TryWrite<Guid>(g, ref this);
    }
    else
    {
        TryWriteBytes(g);
    }
    return g;
}

public bool TryWriteBytes(Span<byte> destination)
{
    if (BitConverter.IsLittleEndian)
    {
        return MemoryMarshal.TryWrite<Guid>(destination, ref this);    
    }
    else
    {
        ... the slower path to write the bytes one by one ...
        ... you make consider using BinaryPrimitives.WriteInt32LittleEndian and BinaryPrimitives.WriteInt16LittleEndian to implement it ...
    }
}

This comment has been minimized.

Copy link
@jkotas

jkotas Dec 3, 2018

Member

Maybe just call TryWriteBytes from ToByteArray directly. ToByteArray is slow allocating method. It is not going to be used on performance critical paths. It is better to have smaller code for it.

// Returns whether bytes are sucessfully written to given span.
public bool TryWriteBytes(Span<byte> destination)
{
if ((uint)destination.Length < 16)

This comment has been minimized.

Copy link
@jkotas

jkotas Dec 3, 2018

Member

Is this bounds check duplicated in MemoryMarshal.Write ?

This comment has been minimized.

Copy link
@EgorBo

EgorBo Dec 3, 2018

Author Contributor

@jkotas I guess MemoryMarshal.Write has it's own bounds checks (according to output) and this uint-hack doesn't help.

This comment has been minimized.

Copy link
@jkotas

jkotas Dec 3, 2018

Member

So why not to use TryWrite for the little endian path like I have suggested above?

This comment has been minimized.

Copy link
@EgorBo

EgorBo Dec 3, 2018

Author Contributor

@jkotas oh, I didn't notice it, thanks!

EgorBo added some commits Dec 3, 2018

}

// slower path for BigEndian
if ((uint)destination.Length >= 16)

This comment has been minimized.

Copy link
@jkotas

jkotas Dec 3, 2018

Member

Nit: Does writing it this way help anything? It is generally more readable to use early returns, like:

 if (destination.Length < 16)
    return false;

destination[0] = ...;
....
WriteByteHelper(destination);
return true;
if (BitConverter.IsLittleEndian)
{

This comment has been minimized.

Copy link
@jkotas

jkotas Dec 3, 2018

Member

The constructor that takes Span can be optimized the same way.

This comment has been minimized.

Copy link
@EgorBo

EgorBo Dec 3, 2018

Author Contributor

but will it compile? C# requires stucts' constructors to init all fields by hands (however Decimal compiles fine hm...).

This comment has been minimized.

Copy link
@EgorBo

EgorBo Dec 3, 2018

Author Contributor

ok it works

this = MemoryMarshal.Read<MyGuid>(b);
destination[12] = _h;
destination[13] = _i;
destination[14] = _j;
destination[15] = _k;

This comment has been minimized.

Copy link
@jkotas

jkotas Dec 3, 2018

Member

Why not to write this one first like you are doing in other places?

This comment has been minimized.

Copy link
@EgorBo

EgorBo Dec 3, 2018

Author Contributor

I thought since it's surrounded with if ((uint)destination.Length >= 16) JIT would not insert bounds checks but turns out it inserts them anyway (I think there is a feature-request for it somewhere...)

This comment has been minimized.

Copy link
@jkotas

jkotas Dec 3, 2018

Member

Right, this optimization is very sensitive to exact pattern today.

This comment has been minimized.

Copy link
@grant-d

grant-d Dec 3, 2018

Contributor

Does it still write bounds checks if you specify 16u?

This comment has been minimized.

Copy link
@EgorBo

EgorBo Dec 3, 2018

Author Contributor

@grant-d unfortunately yes it ignores (uint) > 16 (and 16u)

@EgorBo

This comment has been minimized.

Copy link
Contributor Author

commented Dec 3, 2018

@jkotas I've removed Equals fix (will do that in a separate PR with other places).
I'll rewrite the PR description in an hour and attach better benchmarks for what we've done now 🙂

@jkotas

This comment has been minimized.

Copy link
Member

commented Dec 3, 2018

Could you please also add a short comment like // Hoist bounds checks next to all of the out-of-order writes so that somebody looking at this will get a hint that it is done intentionally this way?

@EgorBo EgorBo changed the title Optimize Guid.Equals and remove some bounds checks Vectorize some Guid APIs (ctor, TryWriteBytes) Dec 3, 2018

@jkotas

jkotas approved these changes Dec 3, 2018

@jkotas jkotas merged commit 248449d into dotnet:master Dec 3, 2018

31 checks passed

CentOS7.1 x64 Checked Innerloop Build and Test Build finished.
Details
CentOS7.1 x64 Debug Innerloop Build Build finished.
Details
Linux-musl x64 Debug Build Build finished.
Details
OSX10.12 x64 Checked Innerloop Build and Test Build finished.
Details
Tizen armel Cross Checked Innerloop Build and Test Build finished.
Details
Ubuntu arm Cross Checked Innerloop Build and Test Build finished.
Details
Ubuntu arm Cross Checked crossgen_comparison Build and Test Build finished.
Details
Ubuntu arm Cross Checked no_tiered_compilation_innerloop Build and Test Build finished.
Details
Ubuntu arm Cross Release crossgen_comparison Build and Test Build finished.
Details
Ubuntu x64 Checked CoreFX Tests Build finished.
Details
Ubuntu x64 Checked Innerloop Build and Test Build finished.
Details
Ubuntu x64 Checked Innerloop Build and Test (Jit - TieredCompilation=0) Build finished.
Details
Ubuntu x64 Formatting Build finished.
Details
Ubuntu16.04 arm64 Cross Checked Innerloop Build and Test Build finished.
Details
Ubuntu16.04 arm64 Cross Checked no_tiered_compilation_innerloop Build and Test Build finished.
Details
WIP Ready for review
Details
Windows_NT arm Cross Debug Innerloop Build Build finished.
Details
Windows_NT arm64 Cross Debug Innerloop Build Build finished.
Details
Windows_NT x64 Checked CoreFX Tests Build finished.
Details
Windows_NT x64 Checked Innerloop Build and Test Build finished.
Details
Windows_NT x64 Checked Innerloop Build and Test (Jit - TieredCompilation=0) Build finished.
Details
Windows_NT x64 Formatting Build finished.
Details
Windows_NT x64 Release CoreFX Tests Build finished.
Details
Windows_NT x64 full_opt ryujit CoreCLR Perf Tests Correctness Build finished.
Details
Windows_NT x64 min_opt ryujit CoreCLR Perf Tests Correctness Build finished.
Details
Windows_NT x86 Checked Innerloop Build and Test Build finished.
Details
Windows_NT x86 Checked Innerloop Build and Test (Jit - TieredCompilation=0) Build finished.
Details
Windows_NT x86 Release Innerloop Build and Test Build finished.
Details
Windows_NT x86 full_opt ryujit CoreCLR Perf Tests Correctness Build finished.
Details
Windows_NT x86 min_opt ryujit CoreCLR Perf Tests Correctness Build finished.
Details
license/cla All CLA requirements met.
Details

Anipik pushed a commit to Anipik/corefx that referenced this pull request Dec 3, 2018

Vectorize some Guid APIs (ctor, TryWriteBytes) (dotnet/coreclr#21336)
* Optimize some Guid APIs

* get rid of WriteByteHelper

* use TryWrite instead of Write

* Optimize ctor `Guid(ReadOnlySpan<byte> b)` and remove `Equals` optimization (move to separate PR).

Signed-off-by: dotnet-bot <dotnet-bot@microsoft.com>

Anipik pushed a commit to Anipik/corert that referenced this pull request Dec 3, 2018

Vectorize some Guid APIs (ctor, TryWriteBytes) (dotnet/coreclr#21336)
* Optimize some Guid APIs

* get rid of WriteByteHelper

* use TryWrite instead of Write

* Optimize ctor `Guid(ReadOnlySpan<byte> b)` and remove `Equals` optimization (move to separate PR).

Signed-off-by: dotnet-bot <dotnet-bot@microsoft.com>

Anipik added a commit to dotnet/corert that referenced this pull request Dec 4, 2018

Vectorize some Guid APIs (ctor, TryWriteBytes) (dotnet/coreclr#21336)
* Optimize some Guid APIs

* get rid of WriteByteHelper

* use TryWrite instead of Write

* Optimize ctor `Guid(ReadOnlySpan<byte> b)` and remove `Equals` optimization (move to separate PR).

Signed-off-by: dotnet-bot <dotnet-bot@microsoft.com>
@@ -841,8 +841,12 @@ public bool Equals(Guid g)
// Now compare each of the elements
return g._a == _a &&

This comment has been minimized.

Copy link
@kellypleahy

kellypleahy Dec 4, 2018

This may be a dumb question that I could have just looked up, but I wanted to ask here so I was sure... What part of the GUID structure is the LSB here? Is it different based on big-endian vs. little-endian? The reason I ask is that if someone is comparing lots of sequentially assigned GUIDs, it's probably faster to compare LSB first, rather than MSB, right? Seems like that would be faster on average than the other short-circuit eval and not slower in the non-sequential case.

This comment has been minimized.

Copy link
@EgorBo

EgorBo Dec 4, 2018

Author Contributor

@kellypleahy it depends on what you mean by "sequentially assigned GUIDs". For instance MS SQL Server increment first int field something like this:
00000000-0000-0000-0000-000000000000
00010000-0000-0000-0000-000000000000

This comment has been minimized.

Copy link
@kellypleahy

kellypleahy Dec 4, 2018

Ah, thanks. It appears that UuidCreateSequential at least seems to update _a when generating sequential guids (and SQL Server just reverses the bytes of _a to big-endian when storing so as to ensure sequential order of the byte array) so in any case, comparing _a first is best by my argument above. Thanks for indulging me.

stephentoub added a commit to dotnet/corefx that referenced this pull request Dec 4, 2018

Vectorize some Guid APIs (ctor, TryWriteBytes) (dotnet/coreclr#21336)
* Optimize some Guid APIs

* get rid of WriteByteHelper

* use TryWrite instead of Write

* Optimize ctor `Guid(ReadOnlySpan<byte> b)` and remove `Equals` optimization (move to separate PR).

Signed-off-by: dotnet-bot <dotnet-bot@microsoft.com>

jlennox added a commit to jlennox/corefx that referenced this pull request Dec 16, 2018

Vectorize some Guid APIs (ctor, TryWriteBytes) (dotnet/coreclr#21336)
* Optimize some Guid APIs

* get rid of WriteByteHelper

* use TryWrite instead of Write

* Optimize ctor `Guid(ReadOnlySpan<byte> b)` and remove `Equals` optimization (move to separate PR).

Signed-off-by: dotnet-bot <dotnet-bot@microsoft.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.