Skip to content
This repository has been archived by the owner on Jan 23, 2023. It is now read-only.

Improve BinaryReader.ReadUInt32() perf by 30% when using MemoryStream #22102

Merged
merged 7 commits into from Jan 28, 2019

Conversation

TomerWeisberg
Copy link

BinaryReader.ReadInt32() has nice optimization which was missing from the ReadUInt32() version. Now both implementations are aligned.

BinaryReader.ReadInt32() has nice optimization which was missing from the ReadUInt32() version. Now both implementations are aligned.
@dnfclas
Copy link

dnfclas commented Jan 20, 2019

CLA assistant check
All CLA requirements met.

@danmoseley
Copy link
Member

Do you have a bit more detail on your perf measurements?

@TomerWeisberg
Copy link
Author

      Do you have a bit more detail on your perf measurements?

I wrote sample app which load 50MB file to memory stream and than read the entire stream in a loop with both ReadInt32 and ReadUInt32. Before my change on average the Int32 version is 30% faster. After my change they are identical.

Also adding ReadGuid function.

ReadUInt32 and all the 64 bits types are now 30% to 50% faster than original implementation.
int mid = ((int)buffer[4]) | ((int)buffer[5] << 8) | ((int)buffer[6] << 16) | ((int)buffer[7] << 24);
int hi = ((int)buffer[8]) | ((int)buffer[9] << 8) | ((int)buffer[10] << 16) | ((int)buffer[11] << 24);
int flags = ((int)buffer[12]) | ((int)buffer[13] << 8) | ((int)buffer[14] << 16) | ((int)buffer[15] << 24);
Debug.Assert((span != null && span.Length >= 16), "[ToDecimal]buffer != null && buffer.Length >= 16");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The span != null should be removed: spans can't be null.

The assert string should also replace "buffer" with "span" and remove the corresponding "!= null" portion.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

int lo = ((int)span[0]) | ((int)span[1] << 8) | ((int)span[2] << 16) | ((int)span[3] << 24);
int mid = ((int)span[4]) | ((int)span[5] << 8) | ((int)span[6] << 16) | ((int)span[7] << 24);
int hi = ((int)span[8]) | ((int)span[9] << 8) | ((int)span[10] << 16) | ((int)span[11] << 24);
int flags = ((int)span[12]) | ((int)span[13] << 8) | ((int)span[14] << 16) | ((int)span[15] << 24);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not really relevant to your PR, but I'd be curious to know if the bounds checks on these span accesses are getting eliminated; presumably that would only happen if this gets inlined and if the JIT can see that the span is guaranteed to be at least 16 bytes long. If it's not, it'd be interesting to measure whether moving this int flags = … line to be the first line improved throughput, and similarly whether reversing the order in this line, such that the span[15] is the first access evaluated.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, you can change this to use BinaryPrimitives.ReadInt32LittleEndian to read the value from the buffer while you are on it.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.


public virtual byte ReadByte()
[MethodImpl(MethodImplOptions.AggressiveInlining)]
private byte InternalReadByte()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this implementation end up being faster than just having ReadByte be => InternalRead(1)[0]?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And this is also a potential breaking change because of Stream.ReadByte() is virtual. It won't be surprising if somebody took dependency on BinaryReader.ReadByte() calling Stream.ReadByte().

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both original and my new implementation is calling Stream.ReadByte() from the Binaryreader.ReadByte() so I am not sure I understand this comment. I don't think I changed anything there. I did change ReadBoolean and ReadSByte to use the Strea,ReadByte(). I think it is more correct.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Calling ReadByte() should be definitely faster for properly implemented streams.

Calling ReadByte() instead of Read() is not bug-for-bug compatible and it can break for incorrectly implemented streams. I think it is ok to do this change in .NET Core. @stephentoub Do you agree?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only 2 functions I changed is ReadBoolean and ReadSByte. As ReadByte already called the Stream.ReadByte than I think the change here does make sense. If streams didn't implement ReadByte correctly than the BinaryReader.ReadByte was already broken even before my change.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Calling ReadByte() instead of Read() is not bug-for-bug compatible and it can break for incorrectly implemented streams. I think it is ok to do this change in .NET Core. @stephentoub Do you agree?

From a compatibility standpoint I think it's ok. And as you say, for optimized streams it should be better. My concern is only that some streams don't override ReadByte, in which case they pick up the base implementation, which allocates a new byte[1] for each call, which will make it worse. On the balance, it's probably a fine change to make: all of our streams override ReadByte, and any stream that cared about performance would as well.

MemoryStream mStream = _stream as MemoryStream;
Debug.Assert(mStream != null, "_stream as MemoryStream != null");

return mStream.InternalReadSpan(numBytes);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like this would be a little better as:

Debug.Assert(_stream is MemoryStream);
return ((MemoryStream)_stream).InternalReadSpan(numBytes);

That way, if somehow we ended up with _stream being a non-null stream that was something else, we'd at least get a cast exception with details rather than a null reference exception, and if it does end up being null, we'll get the same null ref exception we would today.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good idea. fixed.

}
}

[Obsolete("This should not be part of the exposed surface of this class. It is also not used internally anymore.", true)]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, this is a breaking change, both marking this as Obsolete and not using this method anymore, as a derived type expecting this to be used will be sorely disappointed.

@jkotas, were you aware of this issue and decided it was fine due to the impact likely being minimal? At a minimum I don't think we should mark it Obsolete, or we should at least run it through API review.

Copy link
Member

@jkotas jkotas Jan 22, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is @TomerWeisberg 's change. I did not have anything like this in my snippet
jkotas@dc04600

Agree that this should stay is it was. I do not think there is a point in trying to obsolete it nor doing anything else about it to gain a bit of extra performance. The BinaryReader is not a top-performance abstraction by design (e.g. all methods are virtual, ...). If somebody wants the fastest reader, they should use something else.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I marked it as Obsolete to start this discussion. I thought it is not needed and if you will search for it in your favorite search engine you will see people asking why they need it and even some wrong assumptions about it. I don't see how derived type can safely use it without also overriding the Read* methods and basically create something completely custom. As the _buffer is private how can someone change this implementation in a way which will be useful?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is fine to start a discussion about issues like these, but the right place to do it is in a separate issue in corefx repo where we track all API additions, etc.

We prefer smaller PRs that are fixing one problem. They are easier to review and get through the system.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, I removed the Obsolete attribute but I still have no clue when someone should call that function or override it. Can you explain this?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there is a public extension point for 15+ years, people will find creative ways to use it.

I agree with you that this API is not designed well. We typically need to have a good reason to do something about poorly designed APIs (e.g. security risk or high rate of misuse), otherwise it is just not worth the pain to deal with the breaking change. We treat API removals or obsoleting the same way as API additions: File an issue with the proposal in corefx repo, e.g. https://github.com/dotnet/corefx/issues/33602 is an issue about obsoleting String.Copy.

int pos = (_position += 4); // use temp to avoid a race condition
if (pos > _length)
int origPos = _position;
int newPos = origPos + count;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should assert that count is > 0 && <= 16. If count was sufficiently large, this could overflow.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. but as I wrote above, I would like to make it public and made this a real check and not assert. waiting for your feedback.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. but as I wrote above, I would like to make it public and made this a real check and not assert. waiting for your feedback.

Waiting for my feedback on making it public? If you want to expose additional APIs, you should open an issue for them per https://github.com/dotnet/corefx/blob/master/Documentation/project-docs/api-review-process.md.

}
public virtual ulong ReadUInt64() => BinaryPrimitives.ReadUInt64LittleEndian(InternalRead(8));

public virtual Guid ReadGuid() => new Guid(InternalRead(16));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There was no ReadGuid method before. Why are you adding one?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This turned out to be the biggest missing thing in the reader. As only the reader has access to the performant way of reading MemoryStream I think it is worthwhile adding it. If this is ok, I will also add it to the BinaryWriter for symmetry reasons.
What I would think is great improvement is to expose my new memoryStream.InternalReadSpan as public and than other implementors can gain access to speed reading a stream. The thing I wasn't sure about is why MemoryStream has the publiclyVisible property and should we block that function from returning a span when that thing is false. Can you comment about why publiclyVisible is needed and if exposing direct ReadOnlySpan is good idea?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The process for adding new public APIs to .NET Core is documented here: https://github.com/dotnet/corefx/blob/master/Documentation/project-docs/api-review-process.md . If you think there should be more APIs on BinaryReader, start with filing an issue with the proposal in corefx repo.


public virtual unsafe float ReadSingle()
{
FillBuffer(4);
uint tmpBuffer = (uint)(_buffer[0] | _buffer[1] << 8 | _buffer[2] << 16 | _buffer[3] << 24);
ReadOnlySpan<byte> span = InternalRead(4);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BitConverter.Int32BitsToSingle(BinaryPrimitives.ReadInt32LittleEndian(InternalRead(4)))

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed.

_buffer[2] << 16 | _buffer[3] << 24);
uint hi = (uint)(_buffer[4] | _buffer[5] << 8 |
_buffer[6] << 16 | _buffer[7] << 24);
ReadOnlySpan<byte> span = InternalRead(8);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar here.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed.

@@ -649,6 +578,43 @@ public virtual byte[] ReadBytes(int count)
return result;
}

[MethodImpl(MethodImplOptions.AggressiveInlining)]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AggressiveInlining should be the other way around: InternalReadSpan should be the method that should be marked with it in the first place since it is smaller and called just from a single place. This method should not be marked with it.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed


// Check validity of what we got
if (newPos < 0)
throw new IOException(SR.IO_StreamTooLong);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not the right error message to throw here. If newPos is negative here, it means that the read was beyond the end of the stream.

It would be better to change the following check to:

if ((uint)newPos > (uint)_length)

It would cover the integer overflow case without any additional overhead.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed.


public virtual byte ReadByte()
[MethodImpl(MethodImplOptions.AggressiveInlining)]
private byte InternalReadByte()
{
// Inlined to avoid some method call overhead with FillBuffer.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment does not make sense now since FillBuffer is not ever called anywhere else.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed.

return _buffer;
}
}

protected virtual void FillBuffer(int numBytes)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It may be useful to add a comment on top of FillBuffer that explains the history on why it is no longer called.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed.

@@ -226,18 +226,24 @@ internal int InternalGetPosition()
return _position;
}

// PERF: Takes out Int32 as fast as possible
internal int InternalReadInt32()
// PERF: Takes out ReadOnlySpan as fast as possible
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: I don't think this comment is necessary.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Improved.

Copy link
Member

@stephentoub stephentoub left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Can you share final perf measurements of your change, with and without a memory stream, across the various operations being changed?

@TomerWeisberg
Copy link
Author

Performance results from before and after below. Please note that both runs where against private builds of core. one with my changes and one clean. The reason was that I wasnt sure if the official build get the exact same optimizations as what I get when building locally. Both were done against release build thought. The first run is the base line and the 2nd line from the _TW repo is my improved version.

Test is loading a 280MB file into memory and than reading all of it each time with different function.
Everything is the same / way better except of the Int32 versoin that is a tiny bit slower. This is as it is now using the span and not reading directly from the buffer. I think this is acceptable but open to your feedback. Old version both used direct read from buffer and didnt use the helper function to convert to Int.

Let me know what is the next step with getting this integrated to main repository,

Thanks, Tomer.

Starting on C:\Repos\coreclr\bin\Product\Windows_NT.x64.Release\corerun.exe
Ready to test
ReadByte: 293601280 reads. Took: 00:00:02.6801400
ReadSByte: 293601280 reads. Took: 00:00:03.6471306
ReadBoolean: 293601280 reads. Took: 00:00:03.7516095
ReadInt16: 146800640 reads. Took: 00:00:02.2176729
ReadUInt16: 146800640 reads. Took: 00:00:02.1801994
ReadInt32: 73400320 reads. Took: 00:00:00.7459486
ReadUInt32: 73400320 reads. Took: 00:00:01.2596391
ReadInt64: 36700160 reads. Took: 00:00:00.8546779
ReadUInt64: 36700160 reads. Took: 00:00:00.8521659
ReadSingle: 73400320 reads. Took: 00:00:01.3698878
ReadDouble: 36700160 reads. Took: 00:00:00.8798812

Starting on C:\Repos\coreclr_TW\bin\Product\Windows_NT.x64.Release\corerun.exe
Ready to test
ReadByte: 293601280 reads. Took: 00:00:02.5911419
ReadSByte: 293601280 reads. Took: 00:00:02.8700745
ReadBoolean: 293601280 reads. Took: 00:00:02.5418041
ReadInt16: 146800640 reads. Took: 00:00:01.4964327
ReadUInt16: 146800640 reads. Took: 00:00:01.4765707
ReadInt32: 73400320 reads. Took: 00:00:00.7503634
ReadUInt32: 73400320 reads. Took: 00:00:00.7280652
ReadInt64: 36700160 reads. Took: 00:00:00.3798425
ReadUInt64: 36700160 reads. Took: 00:00:00.3760978
ReadSingle: 73400320 reads. Took: 00:00:00.7661817
ReadDouble: 36700160 reads. Took: 00:00:00.3791665

@stephentoub
Copy link
Member

Test is loading a 280MB file into memory and than reading all of it each time with different function.

Thanks. And what about with a FileStream instead of MemoryStream, or with a custom MemoryStream-like type that doesn't derive from MemoryStream?

@TomerWeisberg
Copy link
Author

With file size 260KB. I think results are basically the same.

Starting on C:\Repos\coreclr\bin\Product\Windows_NT.x64.Release\corerun.exe
Ready to test
ReadByte: 266240 reads. Took: 00:00:00.6549561
ReadSByte: 266240 reads. Took: 00:00:00.6969616
ReadBoolean: 266240 reads. Took: 00:00:00.6470082
ReadInt16: 133120 reads. Took: 00:00:00.3158906
ReadUInt16: 133120 reads. Took: 00:00:00.3120279
ReadInt32: 66560 reads. Took: 00:00:00.1562266
ReadUInt32: 66560 reads. Took: 00:00:00.1627752
ReadInt64: 33280 reads. Took: 00:00:00.0782732
ReadUInt64: 33280 reads. Took: 00:00:00.0775452
ReadSingle: 66560 reads. Took: 00:00:00.1574001
ReadDouble: 33280 reads. Took: 00:00:00.0785195

Starting on C:\Repos\coreclr_TW\bin\Product\Windows_NT.x64.Release\corerun.exe
Ready to test
ReadByte: 266240 reads. Took: 00:00:00.6479899
ReadSByte: 266240 reads. Took: 00:00:00.6712671
ReadBoolean: 266240 reads. Took: 00:00:00.6329843
ReadInt16: 133120 reads. Took: 00:00:00.3197063
ReadUInt16: 133120 reads. Took: 00:00:00.3336740
ReadInt32: 66560 reads. Took: 00:00:00.1626926
ReadUInt32: 66560 reads. Took: 00:00:00.1613306
ReadInt64: 33280 reads. Took: 00:00:00.0786398
ReadUInt64: 33280 reads. Took: 00:00:00.0820464
ReadSingle: 66560 reads. Took: 00:00:00.1566169
ReadDouble: 33280 reads. Took: 00:00:00.0776651

@TomerWeisberg
Copy link
Author

I have idea how to improve the none MemoryStream perf a bit. Let me try something out.

@TomerWeisberg
Copy link
Author

Nope. I was wrong. This is as fast as I can make it. Still aligned with previous results so I think we are good,

}
}

// FillBuffer is not performing well when reading from MemoryStreams as it is using the public IStream interface.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no IStream interface in .NET.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed


// FillBuffer is not performing well when reading from MemoryStreams as it is using the public IStream interface.
// We introduced new function InternalRead which can work directly on the MemoryStream internal buffer or using the public IStream
// interface when working with all other streams. This function is not needed anymore but we decided not to delete it for competability
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

competability -> compatibility

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed.

Copy link
Member

@jkotas jkotas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks

@jkotas jkotas merged commit 87cc716 into dotnet:master Jan 28, 2019
picenka21 pushed a commit to picenka21/runtime that referenced this pull request Feb 18, 2022
…dotnet/coreclr#22102)

BinaryReader.ReadInt32() has nice optimization which was missing from the ReadUInt32() version. Now both implementations are aligned.

ReadUInt32 and all the 64 bits types are now 30% to 50% faster than original implementation.

Commit migrated from dotnet/coreclr@87cc716
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
5 participants