Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce and use Indexed BufferReader methods #402

Merged
merged 12 commits into from
Feb 6, 2024

Conversation

iamcarbon
Copy link
Collaborator

@iamcarbon iamcarbon commented Feb 5, 2024

This PR introduces indexed accessors on the Buffer reader, and eliminates another batch of allocations.

Comment on lines +170 to 173
if (values.Length < 2 + 2)
{
return null;
}
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The length is validated upfront, so we don't have to worry about an IOException below.

Comment on lines +103 to 104
if (bytes.Length < 4 + 8)
return null;
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The length is validated upfront, to prevent an IOException later.

return new StringValue(bytes, encoding);
}

public byte[] GetNullTerminatedBytes(int maxLengthBytes, bool moveToMaxLength = false)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may be able to return a ReadOnlySpan here, instead

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Possibly, with the consequence that StringValue will have to change.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'd still need to use ToArray when an array is actually needed, but there's a few cases where we would be able to utilize a span directly.

For instance, here:

var bytes = GetNullTerminatedBytes(index, maxLengthBytes);

return (encoding ?? Encoding.UTF8).GetString(bytes);

I did some exploration on whether to change StringValue to accept ReadOnlyMemory, which could eliminate some allocations at the risk of rooting larger byte[] segments. I'm not sure if it's worth the trade-off.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Longer term I can imagine a lot more use of ReadOnlyMemory<byte> and/or ReadOnlySequence<byte> throughout the library.

Copy link
Owner

@drewnoakes drewnoakes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great. Just one minor regression to fix up and a few comments and this can go in.

MetadataExtractor/Formats/Photoshop/PhotoshopDescriptor.cs Outdated Show resolved Hide resolved
MetadataExtractor/Formats/Photoshop/PhotoshopDescriptor.cs Outdated Show resolved Hide resolved
return new StringValue(bytes, encoding);
}

public byte[] GetNullTerminatedBytes(int maxLengthBytes, bool moveToMaxLength = false)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Longer term I can imagine a lot more use of ReadOnlyMemory<byte> and/or ReadOnlySequence<byte> throughout the library.

Comment on lines -245 to +246
var reader = new SequentialByteArrayReader(bytes);
var reader = new BufferReader(bytes, isBigEndian: true);
var keywordStringValue = reader.GetNullTerminatedStringValue(maxLengthBytes: 79);
var keyword = keywordStringValue.ToString(_utf8Encoding);
var keyword = keywordStringValue.ToString(Encoding.UTF8);
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's another example where GetNullTerminated... could return a span.

(Not advocating one way or another, just gathering data.)

MetadataExtractor/Formats/Icc/IccDescriptor.cs Outdated Show resolved Hide resolved
Comment on lines 151 to 181
// The number of non-null bytes
int length;

byte[] buffer;

if (moveToMaxLength)
{
buffer = GetBytes(maxLengthBytes);
length = Array.IndexOf(buffer, (byte)'\0') switch
{
-1 => maxLengthBytes,
int i => i
};
}
else
{
buffer = new byte[maxLengthBytes];
length = 0;

while (length < buffer.Length && (buffer[length] = GetByte()) != 0)
length++;
}

if (length == 0)
return [];
if (length == maxLengthBytes)
return buffer;
var bytes = new byte[length];
if (length > 0)
Array.Copy(buffer, bytes, length);
return bytes;
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This now lives in two places. I wonder if we can avoid that duplication somehow. Nothing jumps out at me though.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a few allocations here that can be killed as well. On my list to be refactored in a another PR (along with returning a ReadOnlySpan). I'll see if we can share any of that logic with the other implementation when I do.

MetadataExtractor/IO/BufferReader.cs Outdated Show resolved Hide resolved
MetadataExtractor/IO/BufferReader.cs Outdated Show resolved Hide resolved
MetadataExtractor/IO/BufferReader.cs Outdated Show resolved Hide resolved
@iamcarbon
Copy link
Collaborator Author

After this PR, we'll be able to spanify ITypeChecker. Fun seeing these build on each other!

Screenshot 2024-02-05 at 7 41 03 PM

@iamcarbon iamcarbon changed the title [Draft] Introduce and use Indexed BufferReader methods Introduce and use Indexed BufferReader methods Feb 6, 2024
@drewnoakes
Copy link
Owner

After this PR, we'll be able to spanify ITypeChecker. Fun seeing these build on each other!

Nice! I looked at that on the weekend and from memory it was blocked on the TGA checker that needed indexed access. Great that we can spanify it too.

I've been looking at trying to consolidate the THREE different approaches we have for reading ISO BMFF data. We have quite similar code for MP4, QuickTime and HEIF data formats. They're all different enough that it's not so straightforward, but I'm hopeful the end result will be good.

@drewnoakes drewnoakes merged commit f48737f into drewnoakes:main Feb 6, 2024
2 checks passed
@drewnoakes
Copy link
Owner

I ran a trace over the regression suite, using .NET 8 in release mode, with only the .NET runner, which reads all the input files and writes out metadata text files.

image

The run used 9.92 seconds of CPU. We spent 135 ms in the GC, which is < 2%. Pretty good!

image

Interestingly there are 22 gen 2 collections. Some of these are the LOH which would be good to investigate as ideally we wouldn't be allocating anything on the LOH. I suspect they're mostly byte arrays, but will investigate:

image

I also see some finalizers running for reflection emit, which I find surprising. I'll try and track those down.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants