Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve performance of Memory<T>.Span property getter #20386

Merged

Conversation

@GrabYourPitchforks
Copy link
Member

commented Oct 12, 2018

Perf results for accessing the Memory<T>.Span property getter given different backing objects:

Method Toolchain Mean Error StdDev Scaled
CharFromString baseline (27008) 3,360.8 ns 31.849 ns 28.233 ns 1.00
CharFromString w/ changes 2,946.9 ns 24.783 ns 21.969 ns 0.88
CharFromArrayOfChar baseline (27008) 9,092.7 ns 156.120 ns 146.035 ns 1.00
CharFromArrayOfChar w/ changes 3,551.1 ns 40.715 ns 36.093 ns 0.39
CharFromMemoryManagerOfChar baseline (27008) 9,088.1 ns 83.590 ns 78.190 ns 1.00
CharFromMemoryManagerOfChar w/ changes 6,460.6 ns 36.038 ns 28.136 ns 0.71
ByteFromArrayOfByte baseline (27008) 8,479.8 ns 219.163 ns 277.172 ns 1.00
ByteFromArrayOfByte w/ changes 3,233.9 ns 30.808 ns 27.310 ns 0.38
ByteFromArrayOfSByte baseline (27008) 42,603.5 ns 685.866 ns 641.559 ns 1.00
ByteFromArrayOfSByte w/ changes 3,381.6 ns 71.574 ns 155.597 ns 0.08
ByteFromMemoryManagerOfByte baseline (27008) 8,816.3 ns 175.853 ns 164.493 ns 1.00
ByteFromMemoryManagerOfByte w/ changes 6,569.5 ns 35.088 ns 31.105 ns 0.75
GetSpanFromEmptyMemory baseline (27008) 1,207.4 ns 22.214 ns 19.693 ns 1.00
GetSpanFromEmptyMemory w/ changes 887.2 ns 4.268 ns 3.784 ns 0.74

Various optimizations include:

  • We can rely on the Memory<T> ctor and factory methods to ensure that an object of an unexpected type never makes it in to the _object backing field, even in the face of a torn struct. This allows us to use unsafe code, bypassing the runtime type checks, but still requires that we perform bounds checks as appropriate.

  • Single exit from the Span property getter allows the JIT to optimize stack usage and minimize unnecessary data copying.

  • Since we're part of coreclr, we can use deep knowledge of runtime object layout and method table layout to further skip some type checks. This is particularly useful in the case where an array is the backing store.

  • The Length property getter is once again just a simple field access with no bitmask logic.

  • The Span property getter's code gen size shrinks by around 60%.

Unit tests will not pass until the corresponding corefx change comes online since the unit tests perform private reflection over the Memory<T> backing fields.

I realize that this may be contentious, especially considering the fact that this proposes probing into the internals of how objects are represented within the runtime. But coreclr code is uniquely positioned to take advantage of these implementation details since the managed code can evolve in sync with any VM changes.

I'm also open to any recommendations people might have for real-world benchmarks so that we can test whether these changes are actually useful in practice.

@GrabYourPitchforks GrabYourPitchforks force-pushed the GrabYourPitchforks:memoryspan_perf branch from 6cf9325 to 6d04254 Oct 23, 2018

@GrabYourPitchforks

This comment has been minimized.

Copy link
Member Author

commented Oct 23, 2018

The GetHashCode changes in the latest iteration aren't needed since we should never be able to construct a Memory<T> with a null backing object and a non-zero index / length. I'll revert them.

@GrabYourPitchforks

This comment has been minimized.

Copy link
Member Author

commented Oct 23, 2018

I have a pending PR in corefx (dotnet/corefx#32994) to react to the unit test breaks that will occur when this PR goes through.

@stephentoub
Copy link
Member

left a comment

I'll defer to @jkotas on whether we're comfortable with this level of internals reliance, but otherwise, LGTM.

@jkotas

This comment has been minimized.

Copy link
Member

commented Oct 25, 2018

I am ok with depending on internals like this as long as it is within CoreLib.

I am a bit worried about subtle GC holes that GetObjectMethodTablePointer may introduce in certain situations.


return Unsafe.Add(ref Unsafe.As<byte, IntPtr>(ref JitHelpers.GetPinningHelper(obj).m_data), -1);

// Ideally this method would be replaced by the VM with:

This comment has been minimized.

Copy link
@jkotas

jkotas Oct 25, 2018

Member

I do not think we would ever want this. This would mean hacking the JIT to allow derefencing objectrefs as pointers. It is invalid IL and it should stay as invalid IL.

I think what we would want here is to make methodtable pointer to be a field and fetch the field here, similar to how it is done in CoreRT.

This comment has been minimized.

Copy link
@GrabYourPitchforks

GrabYourPitchforks Oct 25, 2018

Author Member

As an experiment, I tried writing an [Intrinsic] method that was VM-replaced by the mentioned IL sequence. The release VM and JIT handled it just fine (and produced the desired codegen), but the checked JIT hit an assert and failed. I'm sure other parts of the JIT would get angry if I tried to force this through. :)

Really all this gets us in the end is that we eliminate a single instruction. It allows us to turn this (the codegen from this PR):

lea rax, [rbx + 8] ; rbx = obj ref, rax = ref obj._firstField
mov rax, qword ptr [rax - 8] ; rax = pMethodTable
; .. remainder of logic ..

Into this (note no lea instruction or SIB syntax):

mov rax, qword ptr [rbx] ; rbx = obj ref, rax = pMethodTable
; .. remainder of logic ..

A single instruction probably isn't worth the complexity. I was just obsessing over the code gen.

This comment has been minimized.

Copy link
@jkotas

jkotas Nov 2, 2018

Member

the checked JIT hit an assert and failed. I'm sure other parts of the JIT would get angry if I tried to force this through. :)

Right, it is invalid IL and it should stay invalid IL. The comment that says we would want to ideally replace it by ldarg.0 + ldind.i should be deleted.

This comment has been minimized.

Copy link
@CarolEidt

CarolEidt Nov 2, 2018

Member

That said, unless someone's already done the analysis and determined that it would be difficult for some fundamental reason for the JIT to generate this directly, you could file an issue so that it could be investigated.

This comment has been minimized.

Copy link
@jkotas

jkotas Nov 2, 2018

Member

The right fix for this should be on the VM side, not in the JIT. To make this work, VM should feed the following IL instruction to the JIT: ldarg.0 + ldfld m_pMethodTable.

This comment has been minimized.

Copy link
@CarolEidt

CarolEidt Nov 2, 2018

Member

@jkotas - But would it be incorrect for the JIT to fold this? I know that we can get into trouble if we incorrectly optimize address expressions to produce interim byref results that don't point into an object, but it would seem that folding in this case could never be invalid - perhaps I'm missing something?

This comment has been minimized.

Copy link
@jkotas

jkotas Nov 2, 2018

Member

But would it be incorrect for the JIT to fold this?

Yes, folding the two instructions in the JIT should be fine too.

@@ -194,6 +196,45 @@ public static bool IsReferenceOrContainsReferences<T>()
// See getILIntrinsicImplementation for how this happens.
throw new InvalidOperationException();
}

// Returns true iff the object has a component size;
// i.e., is variable length like string, array, Utf8String.

This comment has been minimized.

Copy link
@jkotas

jkotas Oct 25, 2018

Member

Nit: We do not have Utf8String and it is unclear whether we will ever have it. It may be a bit premature to start mentioning it in comments.

This comment has been minimized.

Copy link
@GrabYourPitchforks

GrabYourPitchforks Oct 25, 2018

Author Member

Good eye - this was an oversight on my part. This code was copied from the feature/utf8string branch and I forgot to clean up the comments. Will fix.

This comment has been minimized.

Copy link
@jkotas

jkotas Nov 2, 2018

Member

Utf8String still mentioned

@danmosemsft

This comment has been minimized.

Copy link
Member

commented Oct 25, 2018

Cc @tannergooding @eerhardt

Nice to see this as this has come up as a problem in ML.NET when they tried to use Span more.

@GrabYourPitchforks

This comment has been minimized.

Copy link
Member Author

commented Oct 25, 2018

@jkotas You mentioned a potential GC hole in GetObjectMethodTablePointer. Can you elaborate? I believe this should be GC-safe, since getting the ref int to the first field will be an interior managed pointer back into the object, which is GC-tracked. If the object doesn't have any data fields, then the managed pointer should point just past the end of the object, which the GC treats as equivalently to an interior pointer into the object itself. Then backing up by sizeof(void*) should backtrack no earlier than the original object reference pointer, which is still GC-tracked.

One of my assumptions above may be incorrect. If so we can change the method to be more reliable.

@AndyAyersMS

This comment has been minimized.

Copy link
Member

commented Oct 25, 2018

Since you are modelling the resulting object pointer as an IntPtr there is a potential window of vulnerability where the object might move between the time you subtract to create the object pointer and the time you read from the pointer. An actual object reference will get updated but your IntPtr version won't be.

You might get lucky if the jit does not generate fully interruptible GC, or if the jit folds the subtract into an address mode for the read so that this intermediate pointer never exists in a register (or, as in your example above, folds both the first field offset and the subtract to give a net offset of zero) -- but there's no guarantee these will happen.

@jkotas

This comment has been minimized.

Copy link
Member

commented Oct 25, 2018

Tracked byref that points at offset 0 in the object does not show up anywhere else in the system today.

We had problems with this when we started using RyuJIT for CoreRT because of CoreRT had the helper to fetch the object methodtable since forever. We have patched all places where it was blowing up in RyuJIT with asserts, but that does not mean that everything works correctly. In fact, we know that CoreRT has reliability problems that we were not able to trace down - this can very well be one of them.

@GrabYourPitchforks

This comment has been minimized.

Copy link
Member Author

commented Oct 25, 2018

@AndyAyersMS We should end up with a managed IntPtr& (which happens to point to offset 0 inside some object) rather than a regular IntPtr. I'm under the impression that the GC should track this as it does any other byref, but Jan's making me question my reality now. 🤔

@jkotas Do you think this is a potential reliability issue that might scuttle this PR? I can perform a benchmark wherein we first pin the object, but I was trying my absolute hardest to avoid anything that might stack-spill, including pinning.

@jkotas

This comment has been minimized.

Copy link
Member

commented Oct 25, 2018

I'm under the impression that the GC should track this as it does any other byref

You are gettting lucky because of we have syncblock index in front of each object. It makes the system to associate byref pointing at offset zero with he current object as side-effect (most of the time at least - modulo corner case bugs :-). If we did not have syncblocks in front of each object, byref that points at offset zero would have to be associated with end of previous object per ECMA spec.

@jkotas Do you think this is a potential reliability issue that might scuttle this PR?

I am just saying that somebody may end up spending several weeks to trace down the reliability issue if there is one. If you do due-diligence to check for the potential bugs here, I do not have problem with this change. It may be a good idea to run some stress flavors on this.

@GrabYourPitchforks

This comment has been minimized.

Copy link
Member Author

commented Nov 1, 2018

FYI I spoke with Maoni about this offline and she indicated that having an interior pointer to the "zero" offset of an object is valid as far as GC reporting / tracking is concerned. We might still need to stress test this since as Jan suggested it's probably not a well exercised condition.

@jkotas

This comment has been minimized.

Copy link
Member

commented Nov 2, 2018

Unit tests will not pass until the corresponding corefx change comes online since the unit tests perform

The outdated tests should be disabled in https://github.com/dotnet/coreclr/blob/master/tests/CoreFX/CoreFX.issues.json to make the CI green.

@GrabYourPitchforks GrabYourPitchforks changed the title [WIP] Improve performance of Memory<T>.Span property getter Improve performance of Memory<T>.Span property getter Nov 2, 2018

@GrabYourPitchforks GrabYourPitchforks force-pushed the GrabYourPitchforks:memoryspan_perf branch from eeec04e to 3066bfd Nov 2, 2018

@GrabYourPitchforks

This comment has been minimized.

Copy link
Member Author

commented Nov 2, 2018

Rebased PR on top of latest master to pick up Jan's earlier changes re: suppressing failing unit tests.

@@ -407,18 +454,7 @@ public bool Equals(Memory<T> other)
[EditorBrowsable(EditorBrowsableState.Never)]
public override int GetHashCode()
{
return _object != null ? CombineHashCodes(_object.GetHashCode(), _index.GetHashCode(), _length.GetHashCode()) : 0;
return (_object != null) ? ReadOnlyMemory<T>.CombineHashCodes(RuntimeHelpers.GetHashCode(_object), _index, _length) : 0;

This comment has been minimized.

Copy link
@jkotas

jkotas Nov 2, 2018

Member

Why not just use HashCode.Combine ?

This comment has been minimized.

Copy link
@GrabYourPitchforks

GrabYourPitchforks Nov 2, 2018

Author Member

HashCode.Combine is randomized because it pessimistically assumes its inputs are attacker-controlled. The Memory<T> object doesn't really qualify as attacker-controlled, so it can use a more performance-oriented routine.

This comment has been minimized.

Copy link
@jkotas

jkotas Nov 2, 2018

Member

HashCode.Combine also mixes the bits well to get better hashcode distribution than trivial combine functions.

It depends on whether anybody will ever use Memory.GetHashCode/Equals for anything real. These methods do not seem to be very useful...

This comment has been minimized.

Copy link
@GrabYourPitchforks

GrabYourPitchforks Nov 5, 2018

Author Member

That's a good point, especially since the default implementation is referential equality. We can make them use the slower HashCode helpers for now for simplicity, and it leaves the door open for us to make targeted perf improvements to these methods in the future if the need arises.

@jkotas

jkotas approved these changes Nov 2, 2018

Copy link
Member

left a comment

LGTM modulo a few nits. Thanks!

@GrabYourPitchforks

This comment has been minimized.

Copy link
Member Author

commented Nov 2, 2018

@dotnet/jit-contrib There are some stress failures below. Are there any concerns with the failures noted here?

https://ci.dot.net/job/dotnet_coreclr/job/master/job/jitstress/job/x64_checked_windows_nt_corefx_jitstress1_prtest/7/
https://ci.dot.net/job/dotnet_coreclr/job/master/job/jitstress/job/x64_checked_windows_nt_gcstress0x3_prtest/25/

The stress failures don't appear to be related to the code in this PR. (One of the failures is a simple unit test failure for Memory<T> - we already know that test needs to be updated in corefx.)

@GrabYourPitchforks

This comment has been minimized.

Copy link
Member Author

commented Nov 5, 2018

Ping to @dotnet/jit-contrib, wondering if the stress failures noted above are known failures. They don't seem related to this particular code path. (See comment at #20386 (comment).)

GrabYourPitchforks added some commits Sep 11, 2018

Minor cleanup in Mem/ROM<T>
- GetHashCode should always take all three fields into consideration without short-circuiting since Equals does the same
- Removed duplicate helper methods from Mem<T>, changing the callers to use the existing helper methods on ROM<T>

@GrabYourPitchforks GrabYourPitchforks force-pushed the GrabYourPitchforks:memoryspan_perf branch from 3066bfd to d3a9cf0 Nov 6, 2018

@GrabYourPitchforks

This comment has been minimized.

Copy link
Member Author

commented Nov 6, 2018

Rebased on top of latest master due to last week's Memory / ReadOnlyMemory changes causing conflicts. Latest iteration also addresses the remaining comment cleanup, further improves Span property getter performance (using the slice trick we committed last week), and changes GetHashCode to be simpler but subject to a higher performance hit.

@GrabYourPitchforks

This comment has been minimized.

Copy link
Member Author

commented Nov 6, 2018

Heard from @tannergooding that the stress test is a known failure. The particular failing test case takes so long to run that it hits a timeout when running under GC stress mode.

@GrabYourPitchforks GrabYourPitchforks merged commit ef93a72 into dotnet:master Nov 6, 2018

30 of 31 checks passed

Tizen armel Cross Checked Innerloop Build and Test Build finished.
Details
CentOS7.1 x64 Checked Innerloop Build and Test Build finished.
Details
CentOS7.1 x64 Debug Innerloop Build Build finished.
Details
Linux-musl x64 Debug Build Build finished.
Details
OSX10.12 x64 Checked Innerloop Build and Test Build finished.
Details
Ubuntu arm Cross Checked Innerloop Build and Test Build finished.
Details
Ubuntu arm Cross Checked crossgen_comparison Build and Test Build finished.
Details
Ubuntu arm Cross Checked no_tiered_compilation_innerloop Build and Test Build finished.
Details
Ubuntu arm Cross Release crossgen_comparison Build and Test Build finished.
Details
Ubuntu x64 Checked CoreFX Tests Build finished.
Details
Ubuntu x64 Checked Innerloop Build and Test Build finished.
Details
Ubuntu x64 Checked Innerloop Build and Test (Jit - TieredCompilation=0) Build finished.
Details
Ubuntu x64 Formatting Build finished.
Details
Ubuntu16.04 arm64 Cross Checked Innerloop Build and Test Build finished.
Details
Ubuntu16.04 arm64 Cross Checked no_tiered_compilation_innerloop Build and Test Build finished.
Details
WIP Ready for review
Details
Windows_NT arm Cross Debug Innerloop Build Build finished.
Details
Windows_NT arm64 Cross Debug Innerloop Build Build finished.
Details
Windows_NT x64 Checked CoreFX Tests Build finished.
Details
Windows_NT x64 Checked Innerloop Build and Test Build finished.
Details
Windows_NT x64 Checked Innerloop Build and Test (Jit - TieredCompilation=0) Build finished.
Details
Windows_NT x64 Formatting Build finished.
Details
Windows_NT x64 Release CoreFX Tests Build finished.
Details
Windows_NT x64 full_opt ryujit CoreCLR Perf Tests Correctness Build finished.
Details
Windows_NT x64 min_opt ryujit CoreCLR Perf Tests Correctness Build finished.
Details
Windows_NT x86 Checked Innerloop Build and Test Build finished.
Details
Windows_NT x86 Checked Innerloop Build and Test (Jit - TieredCompilation=0) Build finished.
Details
Windows_NT x86 Release Innerloop Build and Test Build finished.
Details
Windows_NT x86 full_opt ryujit CoreCLR Perf Tests Correctness Build finished.
Details
Windows_NT x86 min_opt ryujit CoreCLR Perf Tests Correctness Build finished.
Details
license/cla All CLA requirements met.
Details

@GrabYourPitchforks GrabYourPitchforks deleted the GrabYourPitchforks:memoryspan_perf branch Nov 10, 2018

@@ -486,6 +486,14 @@
{
"name": "System.Buffers.Text.Tests.FormatterTests.TestFormatterDecimal",
"reason": "https://github.com/dotnet/coreclr/pull/19775"
},
{
"name": "System.SpanTests.MemoryMarshalTests.CreateFromPinnedArrayIntSliceRemainsPinned",

This comment has been minimized.

Copy link
@ahsonkhan

ahsonkhan Nov 16, 2018

Member

@GrabYourPitchforks, can we revert this change now?

This comment has been minimized.

Copy link
@jkotas

jkotas Nov 16, 2018

Member

No - we are still using the old copy of the tests. We will get the CoreFX snapshot updated once the migration to Azure DevOps is complete.

A-And added a commit to A-And/coreclr that referenced this pull request Nov 20, 2018

Improve performance of Memory<T>.Span property getter (dotnet#20386)
- We can use our knowledge of object representation in the runtime to speed up type checks.
- We leave the ref T and the length deconstructed until the very end, optimizing register usage.
- The Length property getter is once again just a simple field accessor with no bitwise logic.

@buybackoff buybackoff referenced this pull request Jan 19, 2019

Closed

`Vec<T>` design #3

@ahsonkhan ahsonkhan referenced this pull request Mar 3, 2019

Closed

Memory spec question #2650

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.