Improve performance of Memory<T>.Span property getter #20386

GrabYourPitchforks · 2018-10-12T01:42:49Z

Perf results for accessing the Memory<T>.Span property getter given different backing objects:

Method	Toolchain	Mean	Error	StdDev	Scaled
CharFromString	baseline (27008)	3,360.8 ns	31.849 ns	28.233 ns	1.00
CharFromString	w/ changes	2,946.9 ns	24.783 ns	21.969 ns	0.88

CharFromArrayOfChar	baseline (27008)	9,092.7 ns	156.120 ns	146.035 ns	1.00
CharFromArrayOfChar	w/ changes	3,551.1 ns	40.715 ns	36.093 ns	0.39

CharFromMemoryManagerOfChar	baseline (27008)	9,088.1 ns	83.590 ns	78.190 ns	1.00
CharFromMemoryManagerOfChar	w/ changes	6,460.6 ns	36.038 ns	28.136 ns	0.71

ByteFromArrayOfByte	baseline (27008)	8,479.8 ns	219.163 ns	277.172 ns	1.00
ByteFromArrayOfByte	w/ changes	3,233.9 ns	30.808 ns	27.310 ns	0.38

ByteFromArrayOfSByte	baseline (27008)	42,603.5 ns	685.866 ns	641.559 ns	1.00
ByteFromArrayOfSByte	w/ changes	3,381.6 ns	71.574 ns	155.597 ns	0.08

ByteFromMemoryManagerOfByte	baseline (27008)	8,816.3 ns	175.853 ns	164.493 ns	1.00
ByteFromMemoryManagerOfByte	w/ changes	6,569.5 ns	35.088 ns	31.105 ns	0.75

GetSpanFromEmptyMemory	baseline (27008)	1,207.4 ns	22.214 ns	19.693 ns	1.00
GetSpanFromEmptyMemory	w/ changes	887.2 ns	4.268 ns	3.784 ns	0.74

Various optimizations include:

We can rely on the Memory<T> ctor and factory methods to ensure that an object of an unexpected type never makes it in to the _object backing field, even in the face of a torn struct. This allows us to use unsafe code, bypassing the runtime type checks, but still requires that we perform bounds checks as appropriate.
Single exit from the Span property getter allows the JIT to optimize stack usage and minimize unnecessary data copying.
Since we're part of coreclr, we can use deep knowledge of runtime object layout and method table layout to further skip some type checks. This is particularly useful in the case where an array is the backing store.
The Length property getter is once again just a simple field access with no bitmask logic.
The Span property getter's code gen size shrinks by around 60%.

Unit tests will not pass until the corresponding corefx change comes online since the unit tests perform private reflection over the Memory<T> backing fields.

I realize that this may be contentious, especially considering the fact that this proposes probing into the internals of how objects are represented within the runtime. But coreclr code is uniquely positioned to take advantage of these implementation details since the managed code can evolve in sync with any VM changes.

I'm also open to any recommendations people might have for real-world benchmarks so that we can test whether these changes are actually useful in practice.

src/System.Private.CoreLib/shared/System/Memory.cs

src/System.Private.CoreLib/shared/System/ReadOnlyMemory.cs

GrabYourPitchforks · 2018-10-23T21:58:32Z

The GetHashCode changes in the latest iteration aren't needed since we should never be able to construct a Memory<T> with a null backing object and a non-zero index / length. I'll revert them.

GrabYourPitchforks · 2018-10-23T23:15:46Z

I have a pending PR in corefx (dotnet/corefx#32994) to react to the unit test breaks that will occur when this PR goes through.

src/System.Private.CoreLib/shared/System/Memory.cs

stephentoub

I'll defer to @jkotas on whether we're comfortable with this level of internals reliance, but otherwise, LGTM.

jkotas · 2018-10-25T15:05:42Z

I am ok with depending on internals like this as long as it is within CoreLib.

I am a bit worried about subtle GC holes that GetObjectMethodTablePointer may introduce in certain situations.

jkotas · 2018-10-25T15:08:40Z

src/System.Private.CoreLib/src/System/Runtime/CompilerServices/RuntimeHelpers.cs

+
+            return Unsafe.Add(ref Unsafe.As<byte, IntPtr>(ref JitHelpers.GetPinningHelper(obj).m_data), -1);
+
+            // Ideally this method would be replaced by the VM with:


I do not think we would ever want this. This would mean hacking the JIT to allow derefencing objectrefs as pointers. It is invalid IL and it should stay as invalid IL.

I think what we would want here is to make methodtable pointer to be a field and fetch the field here, similar to how it is done in CoreRT.

As an experiment, I tried writing an [Intrinsic] method that was VM-replaced by the mentioned IL sequence. The release VM and JIT handled it just fine (and produced the desired codegen), but the checked JIT hit an assert and failed. I'm sure other parts of the JIT would get angry if I tried to force this through. :)

Really all this gets us in the end is that we eliminate a single instruction. It allows us to turn this (the codegen from this PR):

lea rax, [rbx + 8] ; rbx = obj ref, rax = ref obj._firstField mov rax, qword ptr [rax - 8] ; rax = pMethodTable ; .. remainder of logic ..

Into this (note no lea instruction or SIB syntax):

mov rax, qword ptr [rbx] ; rbx = obj ref, rax = pMethodTable ; .. remainder of logic ..

A single instruction probably isn't worth the complexity. I was just obsessing over the code gen.

the checked JIT hit an assert and failed. I'm sure other parts of the JIT would get angry if I tried to force this through. :)

Right, it is invalid IL and it should stay invalid IL. The comment that says we would want to ideally replace it by ldarg.0 + ldind.i should be deleted.

That said, unless someone's already done the analysis and determined that it would be difficult for some fundamental reason for the JIT to generate this directly, you could file an issue so that it could be investigated.

The right fix for this should be on the VM side, not in the JIT. To make this work, VM should feed the following IL instruction to the JIT: ldarg.0 + ldfld m_pMethodTable.

@jkotas - But would it be incorrect for the JIT to fold this? I know that we can get into trouble if we incorrectly optimize address expressions to produce interim byref results that don't point into an object, but it would seem that folding in this case could never be invalid - perhaps I'm missing something?

But would it be incorrect for the JIT to fold this?

Yes, folding the two instructions in the JIT should be fine too.

jkotas · 2018-10-25T15:09:40Z

src/System.Private.CoreLib/src/System/Runtime/CompilerServices/RuntimeHelpers.cs

@@ -194,6 +196,45 @@ public static bool IsReferenceOrContainsReferences<T>()
            // See getILIntrinsicImplementation for how this happens.
            throw new InvalidOperationException();
        }
+
+        // Returns true iff the object has a component size;
+        // i.e., is variable length like string, array, Utf8String.


Nit: We do not have Utf8String and it is unclear whether we will ever have it. It may be a bit premature to start mentioning it in comments.

Good eye - this was an oversight on my part. This code was copied from the feature/utf8string branch and I forgot to clean up the comments. Will fix.

Utf8String still mentioned

danmoseley · 2018-10-25T15:48:57Z

Cc @tannergooding @eerhardt

Nice to see this as this has come up as a problem in ML.NET when they tried to use Span more.

GrabYourPitchforks · 2018-10-25T19:00:09Z

@jkotas You mentioned a potential GC hole in GetObjectMethodTablePointer. Can you elaborate? I believe this should be GC-safe, since getting the ref int to the first field will be an interior managed pointer back into the object, which is GC-tracked. If the object doesn't have any data fields, then the managed pointer should point just past the end of the object, which the GC treats as equivalently to an interior pointer into the object itself. Then backing up by sizeof(void*) should backtrack no earlier than the original object reference pointer, which is still GC-tracked.

One of my assumptions above may be incorrect. If so we can change the method to be more reliable.

AndyAyersMS · 2018-10-25T19:23:57Z

Since you are modelling the resulting object pointer as an IntPtr there is a potential window of vulnerability where the object might move between the time you subtract to create the object pointer and the time you read from the pointer. An actual object reference will get updated but your IntPtr version won't be.

You might get lucky if the jit does not generate fully interruptible GC, or if the jit folds the subtract into an address mode for the read so that this intermediate pointer never exists in a register (or, as in your example above, folds both the first field offset and the subtract to give a net offset of zero) -- but there's no guarantee these will happen.

jkotas · 2018-10-25T19:27:21Z

Tracked byref that points at offset 0 in the object does not show up anywhere else in the system today.

We had problems with this when we started using RyuJIT for CoreRT because of CoreRT had the helper to fetch the object methodtable since forever. We have patched all places where it was blowing up in RyuJIT with asserts, but that does not mean that everything works correctly. In fact, we know that CoreRT has reliability problems that we were not able to trace down - this can very well be one of them.

GrabYourPitchforks · 2018-10-25T20:33:32Z

@AndyAyersMS We should end up with a managed IntPtr& (which happens to point to offset 0 inside some object) rather than a regular IntPtr. I'm under the impression that the GC should track this as it does any other byref, but Jan's making me question my reality now. 🤔

@jkotas Do you think this is a potential reliability issue that might scuttle this PR? I can perform a benchmark wherein we first pin the object, but I was trying my absolute hardest to avoid anything that might stack-spill, including pinning.

jkotas · 2018-10-25T21:06:18Z

I'm under the impression that the GC should track this as it does any other byref

You are gettting lucky because of we have syncblock index in front of each object. It makes the system to associate byref pointing at offset zero with he current object as side-effect (most of the time at least - modulo corner case bugs :-). If we did not have syncblocks in front of each object, byref that points at offset zero would have to be associated with end of previous object per ECMA spec.

@jkotas Do you think this is a potential reliability issue that might scuttle this PR?

I am just saying that somebody may end up spending several weeks to trace down the reliability issue if there is one. If you do due-diligence to check for the potential bugs here, I do not have problem with this change. It may be a good idea to run some stress flavors on this.

GrabYourPitchforks · 2018-11-01T08:08:14Z

FYI I spoke with Maoni about this offline and she indicated that having an interior pointer to the "zero" offset of an object is valid as far as GC reporting / tracking is concerned. We might still need to stress test this since as Jan suggested it's probably not a well exercised condition.

jkotas · 2018-11-02T00:49:56Z

Unit tests will not pass until the corresponding corefx change comes online since the unit tests perform

The outdated tests should be disabled in https://github.com/dotnet/coreclr/blob/master/tests/CoreFX/CoreFX.issues.json to make the CI green.

src/System.Private.CoreLib/src/System/Runtime/CompilerServices/RuntimeHelpers.cs

GrabYourPitchforks · 2018-11-02T05:44:08Z

Rebased PR on top of latest master to pick up Jan's earlier changes re: suppressing failing unit tests.

src/System.Private.CoreLib/src/System/Runtime/CompilerServices/RuntimeHelpers.cs

jkotas · 2018-11-02T18:15:29Z

src/System.Private.CoreLib/shared/System/Memory.cs

@@ -407,18 +454,7 @@ public bool Equals(Memory<T> other)
        [EditorBrowsable(EditorBrowsableState.Never)]
        public override int GetHashCode()
        {
-            return _object != null ? CombineHashCodes(_object.GetHashCode(), _index.GetHashCode(), _length.GetHashCode()) : 0;
+            return (_object != null) ? ReadOnlyMemory<T>.CombineHashCodes(RuntimeHelpers.GetHashCode(_object), _index, _length) : 0;


Why not just use HashCode.Combine ?

HashCode.Combine is randomized because it pessimistically assumes its inputs are attacker-controlled. The Memory<T> object doesn't really qualify as attacker-controlled, so it can use a more performance-oriented routine.

HashCode.Combine also mixes the bits well to get better hashcode distribution than trivial combine functions.

It depends on whether anybody will ever use Memory.GetHashCode/Equals for anything real. These methods do not seem to be very useful...

That's a good point, especially since the default implementation is referential equality. We can make them use the slower HashCode helpers for now for simplicity, and it leaves the door open for us to make targeted perf improvements to these methods in the future if the need arises.

jkotas

LGTM modulo a few nits. Thanks!

GrabYourPitchforks · 2018-11-02T18:24:12Z

@dotnet/jit-contrib There are some stress failures below. Are there any concerns with the failures noted here?

https://ci.dot.net/job/dotnet_coreclr/job/master/job/jitstress/job/x64_checked_windows_nt_corefx_jitstress1_prtest/7/
https://ci.dot.net/job/dotnet_coreclr/job/master/job/jitstress/job/x64_checked_windows_nt_gcstress0x3_prtest/25/

The stress failures don't appear to be related to the code in this PR. (One of the failures is a simple unit test failure for Memory<T> - we already know that test needs to be updated in corefx.)

src/System.Private.CoreLib/shared/System/Memory.cs

GrabYourPitchforks · 2018-11-05T18:59:49Z

Ping to @dotnet/jit-contrib, wondering if the stress failures noted above are known failures. They don't seem related to this particular code path. (See comment at #20386 (comment).)

- GetHashCode should always take all three fields into consideration without short-circuiting since Equals does the same - Removed duplicate helper methods from Mem<T>, changing the callers to use the existing helper methods on ROM<T>

GrabYourPitchforks · 2018-11-06T00:41:45Z

Rebased on top of latest master due to last week's Memory / ReadOnlyMemory changes causing conflicts. Latest iteration also addresses the remaining comment cleanup, further improves Span property getter performance (using the slice trick we committed last week), and changes GetHashCode to be simpler but subject to a higher performance hit.

GrabYourPitchforks · 2018-11-06T17:38:57Z

Heard from @tannergooding that the stress test is a known failure. The particular failing test case takes so long to run that it hits a timeout when running under GC stress mode.

ahsonkhan · 2018-11-16T04:05:26Z

tests/CoreFX/CoreFX.issues.json

@@ -486,6 +486,14 @@
                {
                    "name": "System.Buffers.Text.Tests.FormatterTests.TestFormatterDecimal",
                    "reason": "https://github.com/dotnet/coreclr/pull/19775"
+                },
+                {
+                    "name": "System.SpanTests.MemoryMarshalTests.CreateFromPinnedArrayIntSliceRemainsPinned",


@GrabYourPitchforks, can we revert this change now?

No - we are still using the old copy of the tests. We will get the CoreFX snapshot updated once the migration to Azure DevOps is complete.

- We can use our knowledge of object representation in the runtime to speed up type checks. - We leave the ref T and the length deconstructed until the very end, optimizing register usage. - The Length property getter is once again just a simple field accessor with no bitwise logic.

…#20386) - We can use our knowledge of object representation in the runtime to speed up type checks. - We leave the ref T and the length deconstructed until the very end, optimizing register usage. - The Length property getter is once again just a simple field accessor with no bitwise logic. Commit migrated from dotnet/coreclr@ef93a72

ahsonkhan added the area-System.Memory label Oct 12, 2018

benaadams reviewed Oct 13, 2018

View reviewed changes

src/System.Private.CoreLib/shared/System/Memory.cs Show resolved Hide resolved

benaadams reviewed Oct 13, 2018

View reviewed changes

src/System.Private.CoreLib/shared/System/ReadOnlyMemory.cs Show resolved Hide resolved

GrabYourPitchforks force-pushed the memoryspan_perf branch from 6cf9325 to 6d04254 Compare October 23, 2018 21:12

GrabYourPitchforks mentioned this pull request Oct 23, 2018

Update unit tests to account for Memory<T> changes dotnet/corefx#32994

Closed

ahsonkhan mentioned this pull request Oct 24, 2018

Optimize ReadOnlySequence.First for the common case dotnet/corefx#33000

Merged

stephentoub reviewed Oct 25, 2018

View reviewed changes

src/System.Private.CoreLib/shared/System/Memory.cs Show resolved Hide resolved

stephentoub reviewed Oct 25, 2018

View reviewed changes

src/System.Private.CoreLib/shared/System/Memory.cs Show resolved Hide resolved

stephentoub approved these changes Oct 25, 2018

View reviewed changes

jkotas reviewed Oct 25, 2018

View reviewed changes

jkotas reviewed Nov 2, 2018

View reviewed changes

src/System.Private.CoreLib/src/System/Runtime/CompilerServices/RuntimeHelpers.cs Outdated Show resolved Hide resolved

GrabYourPitchforks changed the title ~~[WIP] Improve performance of Memory<T>.Span property getter~~ Improve performance of Memory<T>.Span property getter Nov 2, 2018

GrabYourPitchforks force-pushed the memoryspan_perf branch from eeec04e to 3066bfd Compare November 2, 2018 05:43

MarcoRossignoli reviewed Nov 2, 2018

View reviewed changes

src/System.Private.CoreLib/src/System/Runtime/CompilerServices/RuntimeHelpers.cs Show resolved Hide resolved

jkotas reviewed Nov 2, 2018

View reviewed changes

jkotas approved these changes Nov 2, 2018

View reviewed changes

stephentoub reviewed Nov 5, 2018

View reviewed changes

src/System.Private.CoreLib/shared/System/Memory.cs Show resolved Hide resolved

GrabYourPitchforks added 7 commits November 5, 2018 16:20

Speed up Memory<T>.Span property getter

4870d86

Minor cleanup in Mem/ROM<T>

3008de8

- GetHashCode should always take all three fields into consideration without short-circuiting since Equals does the same - Removed duplicate helper methods from Mem<T>, changing the callers to use the existing helper methods on ROM<T>

Revert unneeded GetHashCode changes

b1da55d

PR feedback, comment cleanup

61f2a2b

Temporarily suppress failing corefx tests

5d8c702

PR feedback, also use "fast slice" trick in Span getter

369288f

Whitespace cleanup

d3a9cf0

GrabYourPitchforks force-pushed the memoryspan_perf branch from 3066bfd to d3a9cf0 Compare November 6, 2018 00:40

Fix bad merge

c7ae7e4

GrabYourPitchforks merged commit ef93a72 into dotnet:master Nov 6, 2018

GrabYourPitchforks deleted the memoryspan_perf branch November 10, 2018 20:32

ahsonkhan reviewed Nov 16, 2018

View reviewed changes

buybackoff mentioned this pull request Jan 19, 2019

Vec<T> design Spreads/Spreads.Native#3

Closed

GrabYourPitchforks mentioned this pull request Feb 25, 2019

[WIP] Further improve performance of Memory<T>.Span property getter #22829

Closed

ahsonkhan mentioned this pull request Mar 3, 2019

Memory spec question dotnet/corefxlab#2650

Closed

GrabYourPitchforks mentioned this pull request Mar 13, 2019

Add Utf8String type #23209

Merged

omariom mentioned this pull request Jan 31, 2020

JIT should recognize a popular type check pattern dotnet/runtime#11396

Closed

buybackoff mentioned this pull request Feb 23, 2022

Read method without locals disruptor-net/Disruptor-net#65

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve performance of Memory<T>.Span property getter #20386

Improve performance of Memory<T>.Span property getter #20386

GrabYourPitchforks commented Oct 12, 2018

GrabYourPitchforks commented Oct 23, 2018

GrabYourPitchforks commented Oct 23, 2018 •

edited

stephentoub left a comment

jkotas commented Oct 25, 2018

jkotas Oct 25, 2018

GrabYourPitchforks Oct 25, 2018 •

edited

jkotas Nov 2, 2018

CarolEidt Nov 2, 2018

jkotas Nov 2, 2018 •

edited

CarolEidt Nov 2, 2018

jkotas Nov 2, 2018

jkotas Oct 25, 2018

GrabYourPitchforks Oct 25, 2018

jkotas Nov 2, 2018

danmoseley commented Oct 25, 2018

GrabYourPitchforks commented Oct 25, 2018

AndyAyersMS commented Oct 25, 2018

jkotas commented Oct 25, 2018

GrabYourPitchforks commented Oct 25, 2018

jkotas commented Oct 25, 2018

GrabYourPitchforks commented Nov 1, 2018

jkotas commented Nov 2, 2018

GrabYourPitchforks commented Nov 2, 2018

jkotas Nov 2, 2018

GrabYourPitchforks Nov 2, 2018

jkotas Nov 2, 2018

GrabYourPitchforks Nov 5, 2018

jkotas left a comment

GrabYourPitchforks commented Nov 2, 2018

GrabYourPitchforks commented Nov 5, 2018

GrabYourPitchforks commented Nov 6, 2018

GrabYourPitchforks commented Nov 6, 2018

ahsonkhan Nov 16, 2018

jkotas Nov 16, 2018


		return Unsafe.Add(ref Unsafe.As<byte, IntPtr>(ref JitHelpers.GetPinningHelper(obj).m_data), -1);

		// Ideally this method would be replaced by the VM with:

Improve performance of Memory<T>.Span property getter #20386

Improve performance of Memory<T>.Span property getter #20386

Conversation

GrabYourPitchforks commented Oct 12, 2018

GrabYourPitchforks commented Oct 23, 2018

GrabYourPitchforks commented Oct 23, 2018 • edited

stephentoub left a comment

Choose a reason for hiding this comment

jkotas commented Oct 25, 2018

Choose a reason for hiding this comment

GrabYourPitchforks Oct 25, 2018 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jkotas Nov 2, 2018 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

danmoseley commented Oct 25, 2018

GrabYourPitchforks commented Oct 25, 2018

AndyAyersMS commented Oct 25, 2018

jkotas commented Oct 25, 2018

GrabYourPitchforks commented Oct 25, 2018

jkotas commented Oct 25, 2018

GrabYourPitchforks commented Nov 1, 2018

jkotas commented Nov 2, 2018

GrabYourPitchforks commented Nov 2, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jkotas left a comment

Choose a reason for hiding this comment

GrabYourPitchforks commented Nov 2, 2018

GrabYourPitchforks commented Nov 5, 2018

GrabYourPitchforks commented Nov 6, 2018

GrabYourPitchforks commented Nov 6, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

GrabYourPitchforks commented Oct 23, 2018 •

edited

GrabYourPitchforks Oct 25, 2018 •

edited

jkotas Nov 2, 2018 •

edited