Improve performance of Memory<T>.Span property getter #20386
Improve performance of Memory<T>.Span property getter #20386
Conversation
6cf9325
to
6d04254
Compare
The GetHashCode changes in the latest iteration aren't needed since we should never be able to construct a |
I have a pending PR in corefx (dotnet/corefx#32994) to react to the unit test breaks that will occur when this PR goes through. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll defer to @jkotas on whether we're comfortable with this level of internals reliance, but otherwise, LGTM.
I am ok with depending on internals like this as long as it is within CoreLib. I am a bit worried about subtle GC holes that |
|
||
return Unsafe.Add(ref Unsafe.As<byte, IntPtr>(ref JitHelpers.GetPinningHelper(obj).m_data), -1); | ||
|
||
// Ideally this method would be replaced by the VM with: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not think we would ever want this. This would mean hacking the JIT to allow derefencing objectrefs as pointers. It is invalid IL and it should stay as invalid IL.
I think what we would want here is to make methodtable pointer to be a field and fetch the field here, similar to how it is done in CoreRT.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As an experiment, I tried writing an [Intrinsic]
method that was VM-replaced by the mentioned IL sequence. The release VM and JIT handled it just fine (and produced the desired codegen), but the checked JIT hit an assert and failed. I'm sure other parts of the JIT would get angry if I tried to force this through. :)
Really all this gets us in the end is that we eliminate a single instruction. It allows us to turn this (the codegen from this PR):
lea rax, [rbx + 8] ; rbx = obj ref, rax = ref obj._firstField
mov rax, qword ptr [rax - 8] ; rax = pMethodTable
; .. remainder of logic ..
Into this (note no lea
instruction or SIB syntax):
mov rax, qword ptr [rbx] ; rbx = obj ref, rax = pMethodTable
; .. remainder of logic ..
A single instruction probably isn't worth the complexity. I was just obsessing over the code gen.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the checked JIT hit an assert and failed. I'm sure other parts of the JIT would get angry if I tried to force this through. :)
Right, it is invalid IL and it should stay invalid IL. The comment that says we would want to ideally replace it by ldarg.0
+ ldind.i
should be deleted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That said, unless someone's already done the analysis and determined that it would be difficult for some fundamental reason for the JIT to generate this directly, you could file an issue so that it could be investigated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The right fix for this should be on the VM side, not in the JIT. To make this work, VM should feed the following IL instruction to the JIT: ldarg.0
+ ldfld m_pMethodTable
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jkotas - But would it be incorrect for the JIT to fold this? I know that we can get into trouble if we incorrectly optimize address expressions to produce interim byref results that don't point into an object, but it would seem that folding in this case could never be invalid - perhaps I'm missing something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But would it be incorrect for the JIT to fold this?
Yes, folding the two instructions in the JIT should be fine too.
@@ -194,6 +196,45 @@ public static bool IsReferenceOrContainsReferences<T>() | |||
// See getILIntrinsicImplementation for how this happens. | |||
throw new InvalidOperationException(); | |||
} | |||
|
|||
// Returns true iff the object has a component size; | |||
// i.e., is variable length like string, array, Utf8String. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: We do not have Utf8String
and it is unclear whether we will ever have it. It may be a bit premature to start mentioning it in comments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good eye - this was an oversight on my part. This code was copied from the feature/utf8string branch and I forgot to clean up the comments. Will fix.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Utf8String still mentioned
Nice to see this as this has come up as a problem in ML.NET when they tried to use Span more. |
@jkotas You mentioned a potential GC hole in One of my assumptions above may be incorrect. If so we can change the method to be more reliable. |
Since you are modelling the resulting object pointer as an You might get lucky if the jit does not generate fully interruptible GC, or if the jit folds the subtract into an address mode for the read so that this intermediate pointer never exists in a register (or, as in your example above, folds both the first field offset and the subtract to give a net offset of zero) -- but there's no guarantee these will happen. |
Tracked byref that points at offset 0 in the object does not show up anywhere else in the system today. We had problems with this when we started using RyuJIT for CoreRT because of CoreRT had the helper to fetch the object methodtable since forever. We have patched all places where it was blowing up in RyuJIT with asserts, but that does not mean that everything works correctly. In fact, we know that CoreRT has reliability problems that we were not able to trace down - this can very well be one of them. |
@AndyAyersMS We should end up with a managed IntPtr& (which happens to point to offset 0 inside some object) rather than a regular IntPtr. I'm under the impression that the GC should track this as it does any other byref, but Jan's making me question my reality now. 🤔 @jkotas Do you think this is a potential reliability issue that might scuttle this PR? I can perform a benchmark wherein we first pin the object, but I was trying my absolute hardest to avoid anything that might stack-spill, including pinning. |
You are gettting lucky because of we have syncblock index in front of each object. It makes the system to associate
I am just saying that somebody may end up spending several weeks to trace down the reliability issue if there is one. If you do due-diligence to check for the potential bugs here, I do not have problem with this change. It may be a good idea to run some stress flavors on this. |
FYI I spoke with Maoni about this offline and she indicated that having an interior pointer to the "zero" offset of an object is valid as far as GC reporting / tracking is concerned. We might still need to stress test this since as Jan suggested it's probably not a well exercised condition. |
The outdated tests should be disabled in https://github.com/dotnet/coreclr/blob/master/tests/CoreFX/CoreFX.issues.json to make the CI green. |
src/System.Private.CoreLib/src/System/Runtime/CompilerServices/RuntimeHelpers.cs
Outdated
Show resolved
Hide resolved
eeec04e
to
3066bfd
Compare
Rebased PR on top of latest master to pick up Jan's earlier changes re: suppressing failing unit tests. |
src/System.Private.CoreLib/src/System/Runtime/CompilerServices/RuntimeHelpers.cs
Show resolved
Hide resolved
@@ -407,18 +454,7 @@ public bool Equals(Memory<T> other) | |||
[EditorBrowsable(EditorBrowsableState.Never)] | |||
public override int GetHashCode() | |||
{ | |||
return _object != null ? CombineHashCodes(_object.GetHashCode(), _index.GetHashCode(), _length.GetHashCode()) : 0; | |||
return (_object != null) ? ReadOnlyMemory<T>.CombineHashCodes(RuntimeHelpers.GetHashCode(_object), _index, _length) : 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not just use HashCode.Combine
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
HashCode.Combine
is randomized because it pessimistically assumes its inputs are attacker-controlled. The Memory<T>
object doesn't really qualify as attacker-controlled, so it can use a more performance-oriented routine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
HashCode.Combine
also mixes the bits well to get better hashcode distribution than trivial combine functions.
It depends on whether anybody will ever use Memory.GetHashCode/Equals
for anything real. These methods do not seem to be very useful...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good point, especially since the default implementation is referential equality. We can make them use the slower HashCode
helpers for now for simplicity, and it leaves the door open for us to make targeted perf improvements to these methods in the future if the need arises.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM modulo a few nits. Thanks!
@dotnet/jit-contrib There are some stress failures below. Are there any concerns with the failures noted here? https://ci.dot.net/job/dotnet_coreclr/job/master/job/jitstress/job/x64_checked_windows_nt_corefx_jitstress1_prtest/7/ The stress failures don't appear to be related to the code in this PR. (One of the failures is a simple unit test failure for |
Ping to @dotnet/jit-contrib, wondering if the stress failures noted above are known failures. They don't seem related to this particular code path. (See comment at #20386 (comment).) |
- GetHashCode should always take all three fields into consideration without short-circuiting since Equals does the same - Removed duplicate helper methods from Mem<T>, changing the callers to use the existing helper methods on ROM<T>
3066bfd
to
d3a9cf0
Compare
Rebased on top of latest master due to last week's |
Heard from @tannergooding that the stress test is a known failure. The particular failing test case takes so long to run that it hits a timeout when running under GC stress mode. |
@@ -486,6 +486,14 @@ | |||
{ | |||
"name": "System.Buffers.Text.Tests.FormatterTests.TestFormatterDecimal", | |||
"reason": "https://github.com/dotnet/coreclr/pull/19775" | |||
}, | |||
{ | |||
"name": "System.SpanTests.MemoryMarshalTests.CreateFromPinnedArrayIntSliceRemainsPinned", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@GrabYourPitchforks, can we revert this change now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No - we are still using the old copy of the tests. We will get the CoreFX snapshot updated once the migration to Azure DevOps is complete.
- We can use our knowledge of object representation in the runtime to speed up type checks. - We leave the ref T and the length deconstructed until the very end, optimizing register usage. - The Length property getter is once again just a simple field accessor with no bitwise logic.
…#20386) - We can use our knowledge of object representation in the runtime to speed up type checks. - We leave the ref T and the length deconstructed until the very end, optimizing register usage. - The Length property getter is once again just a simple field accessor with no bitwise logic. Commit migrated from dotnet/coreclr@ef93a72
Perf results for accessing the
Memory<T>.Span
property getter given different backing objects:Various optimizations include:
We can rely on the
Memory<T>
ctor and factory methods to ensure that an object of an unexpected type never makes it in to the_object
backing field, even in the face of a torn struct. This allows us to use unsafe code, bypassing the runtime type checks, but still requires that we perform bounds checks as appropriate.Single exit from the Span property getter allows the JIT to optimize stack usage and minimize unnecessary data copying.
Since we're part of coreclr, we can use deep knowledge of runtime object layout and method table layout to further skip some type checks. This is particularly useful in the case where an array is the backing store.
The Length property getter is once again just a simple field access with no bitmask logic.
The Span property getter's code gen size shrinks by around 60%.
Unit tests will not pass until the corresponding corefx change comes online since the unit tests perform private reflection over the
Memory<T>
backing fields.I realize that this may be contentious, especially considering the fact that this proposes probing into the internals of how objects are represented within the runtime. But coreclr code is uniquely positioned to take advantage of these implementation details since the managed code can evolve in sync with any VM changes.
I'm also open to any recommendations people might have for real-world benchmarks so that we can test whether these changes are actually useful in practice.