Skip to content
This repository has been archived by the owner on Jan 23, 2023. It is now read-only.

Improve performance of Memory<T>.Span property getter #20386

Merged

Conversation

GrabYourPitchforks
Copy link
Member

Perf results for accessing the Memory<T>.Span property getter given different backing objects:

Method Toolchain Mean Error StdDev Scaled
CharFromString baseline (27008) 3,360.8 ns 31.849 ns 28.233 ns 1.00
CharFromString w/ changes 2,946.9 ns 24.783 ns 21.969 ns 0.88
CharFromArrayOfChar baseline (27008) 9,092.7 ns 156.120 ns 146.035 ns 1.00
CharFromArrayOfChar w/ changes 3,551.1 ns 40.715 ns 36.093 ns 0.39
CharFromMemoryManagerOfChar baseline (27008) 9,088.1 ns 83.590 ns 78.190 ns 1.00
CharFromMemoryManagerOfChar w/ changes 6,460.6 ns 36.038 ns 28.136 ns 0.71
ByteFromArrayOfByte baseline (27008) 8,479.8 ns 219.163 ns 277.172 ns 1.00
ByteFromArrayOfByte w/ changes 3,233.9 ns 30.808 ns 27.310 ns 0.38
ByteFromArrayOfSByte baseline (27008) 42,603.5 ns 685.866 ns 641.559 ns 1.00
ByteFromArrayOfSByte w/ changes 3,381.6 ns 71.574 ns 155.597 ns 0.08
ByteFromMemoryManagerOfByte baseline (27008) 8,816.3 ns 175.853 ns 164.493 ns 1.00
ByteFromMemoryManagerOfByte w/ changes 6,569.5 ns 35.088 ns 31.105 ns 0.75
GetSpanFromEmptyMemory baseline (27008) 1,207.4 ns 22.214 ns 19.693 ns 1.00
GetSpanFromEmptyMemory w/ changes 887.2 ns 4.268 ns 3.784 ns 0.74

Various optimizations include:

  • We can rely on the Memory<T> ctor and factory methods to ensure that an object of an unexpected type never makes it in to the _object backing field, even in the face of a torn struct. This allows us to use unsafe code, bypassing the runtime type checks, but still requires that we perform bounds checks as appropriate.

  • Single exit from the Span property getter allows the JIT to optimize stack usage and minimize unnecessary data copying.

  • Since we're part of coreclr, we can use deep knowledge of runtime object layout and method table layout to further skip some type checks. This is particularly useful in the case where an array is the backing store.

  • The Length property getter is once again just a simple field access with no bitmask logic.

  • The Span property getter's code gen size shrinks by around 60%.

Unit tests will not pass until the corresponding corefx change comes online since the unit tests perform private reflection over the Memory<T> backing fields.

I realize that this may be contentious, especially considering the fact that this proposes probing into the internals of how objects are represented within the runtime. But coreclr code is uniquely positioned to take advantage of these implementation details since the managed code can evolve in sync with any VM changes.

I'm also open to any recommendations people might have for real-world benchmarks so that we can test whether these changes are actually useful in practice.

@GrabYourPitchforks
Copy link
Member Author

The GetHashCode changes in the latest iteration aren't needed since we should never be able to construct a Memory<T> with a null backing object and a non-zero index / length. I'll revert them.

@GrabYourPitchforks
Copy link
Member Author

GrabYourPitchforks commented Oct 23, 2018

I have a pending PR in corefx (dotnet/corefx#32994) to react to the unit test breaks that will occur when this PR goes through.

Copy link
Member

@stephentoub stephentoub left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll defer to @jkotas on whether we're comfortable with this level of internals reliance, but otherwise, LGTM.

@jkotas
Copy link
Member

jkotas commented Oct 25, 2018

I am ok with depending on internals like this as long as it is within CoreLib.

I am a bit worried about subtle GC holes that GetObjectMethodTablePointer may introduce in certain situations.


return Unsafe.Add(ref Unsafe.As<byte, IntPtr>(ref JitHelpers.GetPinningHelper(obj).m_data), -1);

// Ideally this method would be replaced by the VM with:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not think we would ever want this. This would mean hacking the JIT to allow derefencing objectrefs as pointers. It is invalid IL and it should stay as invalid IL.

I think what we would want here is to make methodtable pointer to be a field and fetch the field here, similar to how it is done in CoreRT.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As an experiment, I tried writing an [Intrinsic] method that was VM-replaced by the mentioned IL sequence. The release VM and JIT handled it just fine (and produced the desired codegen), but the checked JIT hit an assert and failed. I'm sure other parts of the JIT would get angry if I tried to force this through. :)

Really all this gets us in the end is that we eliminate a single instruction. It allows us to turn this (the codegen from this PR):

lea rax, [rbx + 8] ; rbx = obj ref, rax = ref obj._firstField
mov rax, qword ptr [rax - 8] ; rax = pMethodTable
; .. remainder of logic ..

Into this (note no lea instruction or SIB syntax):

mov rax, qword ptr [rbx] ; rbx = obj ref, rax = pMethodTable
; .. remainder of logic ..

A single instruction probably isn't worth the complexity. I was just obsessing over the code gen.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the checked JIT hit an assert and failed. I'm sure other parts of the JIT would get angry if I tried to force this through. :)

Right, it is invalid IL and it should stay invalid IL. The comment that says we would want to ideally replace it by ldarg.0 + ldind.i should be deleted.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That said, unless someone's already done the analysis and determined that it would be difficult for some fundamental reason for the JIT to generate this directly, you could file an issue so that it could be investigated.

Copy link
Member

@jkotas jkotas Nov 2, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The right fix for this should be on the VM side, not in the JIT. To make this work, VM should feed the following IL instruction to the JIT: ldarg.0 + ldfld m_pMethodTable.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jkotas - But would it be incorrect for the JIT to fold this? I know that we can get into trouble if we incorrectly optimize address expressions to produce interim byref results that don't point into an object, but it would seem that folding in this case could never be invalid - perhaps I'm missing something?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But would it be incorrect for the JIT to fold this?

Yes, folding the two instructions in the JIT should be fine too.

@@ -194,6 +196,45 @@ public static bool IsReferenceOrContainsReferences<T>()
// See getILIntrinsicImplementation for how this happens.
throw new InvalidOperationException();
}

// Returns true iff the object has a component size;
// i.e., is variable length like string, array, Utf8String.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: We do not have Utf8String and it is unclear whether we will ever have it. It may be a bit premature to start mentioning it in comments.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good eye - this was an oversight on my part. This code was copied from the feature/utf8string branch and I forgot to clean up the comments. Will fix.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Utf8String still mentioned

@danmoseley
Copy link
Member

Cc @tannergooding @eerhardt

Nice to see this as this has come up as a problem in ML.NET when they tried to use Span more.

@GrabYourPitchforks
Copy link
Member Author

@jkotas You mentioned a potential GC hole in GetObjectMethodTablePointer. Can you elaborate? I believe this should be GC-safe, since getting the ref int to the first field will be an interior managed pointer back into the object, which is GC-tracked. If the object doesn't have any data fields, then the managed pointer should point just past the end of the object, which the GC treats as equivalently to an interior pointer into the object itself. Then backing up by sizeof(void*) should backtrack no earlier than the original object reference pointer, which is still GC-tracked.

One of my assumptions above may be incorrect. If so we can change the method to be more reliable.

@AndyAyersMS
Copy link
Member

Since you are modelling the resulting object pointer as an IntPtr there is a potential window of vulnerability where the object might move between the time you subtract to create the object pointer and the time you read from the pointer. An actual object reference will get updated but your IntPtr version won't be.

You might get lucky if the jit does not generate fully interruptible GC, or if the jit folds the subtract into an address mode for the read so that this intermediate pointer never exists in a register (or, as in your example above, folds both the first field offset and the subtract to give a net offset of zero) -- but there's no guarantee these will happen.

@jkotas
Copy link
Member

jkotas commented Oct 25, 2018

Tracked byref that points at offset 0 in the object does not show up anywhere else in the system today.

We had problems with this when we started using RyuJIT for CoreRT because of CoreRT had the helper to fetch the object methodtable since forever. We have patched all places where it was blowing up in RyuJIT with asserts, but that does not mean that everything works correctly. In fact, we know that CoreRT has reliability problems that we were not able to trace down - this can very well be one of them.

@GrabYourPitchforks
Copy link
Member Author

@AndyAyersMS We should end up with a managed IntPtr& (which happens to point to offset 0 inside some object) rather than a regular IntPtr. I'm under the impression that the GC should track this as it does any other byref, but Jan's making me question my reality now. 🤔

@jkotas Do you think this is a potential reliability issue that might scuttle this PR? I can perform a benchmark wherein we first pin the object, but I was trying my absolute hardest to avoid anything that might stack-spill, including pinning.

@jkotas
Copy link
Member

jkotas commented Oct 25, 2018

I'm under the impression that the GC should track this as it does any other byref

You are gettting lucky because of we have syncblock index in front of each object. It makes the system to associate byref pointing at offset zero with he current object as side-effect (most of the time at least - modulo corner case bugs :-). If we did not have syncblocks in front of each object, byref that points at offset zero would have to be associated with end of previous object per ECMA spec.

@jkotas Do you think this is a potential reliability issue that might scuttle this PR?

I am just saying that somebody may end up spending several weeks to trace down the reliability issue if there is one. If you do due-diligence to check for the potential bugs here, I do not have problem with this change. It may be a good idea to run some stress flavors on this.

@GrabYourPitchforks
Copy link
Member Author

FYI I spoke with Maoni about this offline and she indicated that having an interior pointer to the "zero" offset of an object is valid as far as GC reporting / tracking is concerned. We might still need to stress test this since as Jan suggested it's probably not a well exercised condition.

@jkotas
Copy link
Member

jkotas commented Nov 2, 2018

Unit tests will not pass until the corresponding corefx change comes online since the unit tests perform

The outdated tests should be disabled in https://github.com/dotnet/coreclr/blob/master/tests/CoreFX/CoreFX.issues.json to make the CI green.

@GrabYourPitchforks GrabYourPitchforks changed the title [WIP] Improve performance of Memory<T>.Span property getter Improve performance of Memory<T>.Span property getter Nov 2, 2018
@GrabYourPitchforks
Copy link
Member Author

Rebased PR on top of latest master to pick up Jan's earlier changes re: suppressing failing unit tests.

@@ -407,18 +454,7 @@ public bool Equals(Memory<T> other)
[EditorBrowsable(EditorBrowsableState.Never)]
public override int GetHashCode()
{
return _object != null ? CombineHashCodes(_object.GetHashCode(), _index.GetHashCode(), _length.GetHashCode()) : 0;
return (_object != null) ? ReadOnlyMemory<T>.CombineHashCodes(RuntimeHelpers.GetHashCode(_object), _index, _length) : 0;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not just use HashCode.Combine ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

HashCode.Combine is randomized because it pessimistically assumes its inputs are attacker-controlled. The Memory<T> object doesn't really qualify as attacker-controlled, so it can use a more performance-oriented routine.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

HashCode.Combine also mixes the bits well to get better hashcode distribution than trivial combine functions.

It depends on whether anybody will ever use Memory.GetHashCode/Equals for anything real. These methods do not seem to be very useful...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good point, especially since the default implementation is referential equality. We can make them use the slower HashCode helpers for now for simplicity, and it leaves the door open for us to make targeted perf improvements to these methods in the future if the need arises.

Copy link
Member

@jkotas jkotas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM modulo a few nits. Thanks!

@GrabYourPitchforks
Copy link
Member Author

@dotnet/jit-contrib There are some stress failures below. Are there any concerns with the failures noted here?

https://ci.dot.net/job/dotnet_coreclr/job/master/job/jitstress/job/x64_checked_windows_nt_corefx_jitstress1_prtest/7/
https://ci.dot.net/job/dotnet_coreclr/job/master/job/jitstress/job/x64_checked_windows_nt_gcstress0x3_prtest/25/

The stress failures don't appear to be related to the code in this PR. (One of the failures is a simple unit test failure for Memory<T> - we already know that test needs to be updated in corefx.)

@GrabYourPitchforks
Copy link
Member Author

Ping to @dotnet/jit-contrib, wondering if the stress failures noted above are known failures. They don't seem related to this particular code path. (See comment at #20386 (comment).)

- GetHashCode should always take all three fields into consideration without short-circuiting since Equals does the same
- Removed duplicate helper methods from Mem<T>, changing the callers to use the existing helper methods on ROM<T>
@GrabYourPitchforks
Copy link
Member Author

Rebased on top of latest master due to last week's Memory / ReadOnlyMemory changes causing conflicts. Latest iteration also addresses the remaining comment cleanup, further improves Span property getter performance (using the slice trick we committed last week), and changes GetHashCode to be simpler but subject to a higher performance hit.

@GrabYourPitchforks
Copy link
Member Author

Heard from @tannergooding that the stress test is a known failure. The particular failing test case takes so long to run that it hits a timeout when running under GC stress mode.

@GrabYourPitchforks GrabYourPitchforks merged commit ef93a72 into dotnet:master Nov 6, 2018
@GrabYourPitchforks GrabYourPitchforks deleted the memoryspan_perf branch November 10, 2018 20:32
@@ -486,6 +486,14 @@
{
"name": "System.Buffers.Text.Tests.FormatterTests.TestFormatterDecimal",
"reason": "https://github.com/dotnet/coreclr/pull/19775"
},
{
"name": "System.SpanTests.MemoryMarshalTests.CreateFromPinnedArrayIntSliceRemainsPinned",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@GrabYourPitchforks, can we revert this change now?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No - we are still using the old copy of the tests. We will get the CoreFX snapshot updated once the migration to Azure DevOps is complete.

A-And pushed a commit to A-And/coreclr that referenced this pull request Nov 20, 2018
- We can use our knowledge of object representation in the runtime to speed up type checks.
- We leave the ref T and the length deconstructed until the very end, optimizing register usage.
- The Length property getter is once again just a simple field accessor with no bitwise logic.
picenka21 pushed a commit to picenka21/runtime that referenced this pull request Feb 18, 2022
…#20386)

- We can use our knowledge of object representation in the runtime to speed up type checks.
- We leave the ref T and the length deconstructed until the very end, optimizing register usage.
- The Length property getter is once again just a simple field accessor with no bitwise logic.

Commit migrated from dotnet/coreclr@ef93a72
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet