Introduce and use Indexed BufferReader methods #402

iamcarbon · 2024-02-05T19:40:13Z

This PR introduces indexed accessors on the Buffer reader, and eliminates another batch of allocations.

iamcarbon · 2024-02-05T23:54:27Z

MetadataExtractor/Formats/Exif/makernotes/PanasonicMakernoteDescriptor.cs

+            if (values.Length < 2 + 2)
            {
                return null;
            }


The length is validated upfront, so we don't have to worry about an IOException below.

iamcarbon · 2024-02-05T23:54:55Z

MetadataExtractor/Formats/Photoshop/PhotoshopDescriptor.cs

+            if (bytes.Length < 4 + 8)
                return null;


The length is validated upfront, to prevent an IOException later.

iamcarbon · 2024-02-05T23:55:55Z

MetadataExtractor/IO/BufferReader.cs

+        return new StringValue(bytes, encoding);
+    }
+
+    public byte[] GetNullTerminatedBytes(int maxLengthBytes, bool moveToMaxLength = false)


We may be able to return a ReadOnlySpan here, instead

Possibly, with the consequence that StringValue will have to change.

We'd still need to use ToArray when an array is actually needed, but there's a few cases where we would be able to utilize a span directly.

For instance, here:

var bytes = GetNullTerminatedBytes(index, maxLengthBytes); return (encoding ?? Encoding.UTF8).GetString(bytes);

I did some exploration on whether to change StringValue to accept ReadOnlyMemory, which could eliminate some allocations at the risk of rooting larger byte[] segments. I'm not sure if it's worth the trade-off.

Longer term I can imagine a lot more use of ReadOnlyMemory<byte> and/or ReadOnlySequence<byte> throughout the library.

drewnoakes

Looks great. Just one minor regression to fix up and a few comments and this can go in.

MetadataExtractor/Formats/Photoshop/PhotoshopDescriptor.cs

drewnoakes · 2024-02-06T01:12:22Z

MetadataExtractor/IO/BufferReader.cs

+        return new StringValue(bytes, encoding);
+    }
+
+    public byte[] GetNullTerminatedBytes(int maxLengthBytes, bool moveToMaxLength = false)


Longer term I can imagine a lot more use of ReadOnlyMemory<byte> and/or ReadOnlySequence<byte> throughout the library.

drewnoakes · 2024-02-06T01:17:57Z

MetadataExtractor/Formats/Png/PngMetadataReader.cs

-                var reader = new SequentialByteArrayReader(bytes);
+                var reader = new BufferReader(bytes, isBigEndian: true);
                var keywordStringValue = reader.GetNullTerminatedStringValue(maxLengthBytes: 79);
-                var keyword = keywordStringValue.ToString(_utf8Encoding);
+                var keyword = keywordStringValue.ToString(Encoding.UTF8);


Here's another example where GetNullTerminated... could return a span.

(Not advocating one way or another, just gathering data.)

MetadataExtractor/Formats/Icc/IccDescriptor.cs

drewnoakes · 2024-02-06T02:04:46Z

MetadataExtractor/IO/BufferReader.cs

+        // The number of non-null bytes
+        int length;
+
+        byte[] buffer;
+
+        if (moveToMaxLength)
+        {
+            buffer = GetBytes(maxLengthBytes);
+            length = Array.IndexOf(buffer, (byte)'\0') switch
+            {
+                -1 => maxLengthBytes,
+                int i => i
+            };
+        }
+        else
+        {
+            buffer = new byte[maxLengthBytes];
+            length = 0;
+
+            while (length < buffer.Length && (buffer[length] = GetByte()) != 0)
+                length++;
+        }
+
+        if (length == 0)
+            return [];
+        if (length == maxLengthBytes)
+            return buffer;
+        var bytes = new byte[length];
+        if (length > 0)
+            Array.Copy(buffer, bytes, length);
+        return bytes;


This now lives in two places. I wonder if we can avoid that duplication somehow. Nothing jumps out at me though.

There's a few allocations here that can be killed as well. On my list to be refactored in a another PR (along with returning a ReadOnlySpan). I'll see if we can share any of that logic with the other implementation when I do.

MetadataExtractor/IO/BufferReader.cs

iamcarbon · 2024-02-06T03:41:56Z

After this PR, we'll be able to spanify ITypeChecker. Fun seeing these build on each other!

drewnoakes · 2024-02-06T10:31:52Z

After this PR, we'll be able to spanify ITypeChecker. Fun seeing these build on each other!

Nice! I looked at that on the weekend and from memory it was blocked on the TGA checker that needed indexed access. Great that we can spanify it too.

I've been looking at trying to consolidate the THREE different approaches we have for reading ISO BMFF data. We have quite similar code for MP4, QuickTime and HEIF data formats. They're all different enough that it's not so straightforward, but I'm hopeful the end result will be good.

drewnoakes · 2024-02-06T13:38:38Z

I ran a trace over the regression suite, using .NET 8 in release mode, with only the .NET runner, which reads all the input files and writes out metadata text files.

The run used 9.92 seconds of CPU. We spent 135 ms in the GC, which is < 2%. Pretty good!

Interestingly there are 22 gen 2 collections. Some of these are the LOH which would be good to investigate as ideally we wouldn't be allocating anything on the LOH. I suspect they're mostly byte arrays, but will investigate:

I also see some finalizers running for reflection emit, which I find surprising. I'll try and track those down.

iamcarbon added 4 commits February 5, 2024 11:03

Add indexed methods to BufferReader

421a818

Use BufferReader to reduce allocations

66b86e1

Extend Buffer reader with additional indexed methods

f9d439c

Use BufferReader to reduce allocations (2/n)

de38b42

iamcarbon commented Feb 5, 2024

View reviewed changes

drewnoakes reviewed Feb 6, 2024

View reviewed changes

iamcarbon added 5 commits February 5, 2024 19:13

Fix overlapping read in PhotoshopDescriptor

e387ab2

Simplify BufferReader.GetInt32() call

a18b7e0

Fix endianess in IccDescriptor

ba76c9f

Break out indexed BufferReader logic into partial

695051f

Use .net8.0 BinaryPrimitives methods

8adb228

iamcarbon changed the title ~~[Draft] Introduce and use Indexed BufferReader methods~~ Introduce and use Indexed BufferReader methods Feb 6, 2024

drewnoakes added 3 commits February 6, 2024 21:19

Make methods readonly

5222321

Move sequential members to another part

151b5e3

Add some API docs

bec7461

drewnoakes merged commit f48737f into drewnoakes:main Feb 6, 2024
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Introduce and use Indexed BufferReader methods #402

Introduce and use Indexed BufferReader methods #402

iamcarbon commented Feb 5, 2024 •

edited

iamcarbon Feb 5, 2024

iamcarbon Feb 5, 2024

iamcarbon Feb 5, 2024

kwhopper Feb 6, 2024

iamcarbon Feb 6, 2024

drewnoakes Feb 6, 2024

drewnoakes left a comment

drewnoakes Feb 6, 2024

drewnoakes Feb 6, 2024

drewnoakes Feb 6, 2024

iamcarbon Feb 6, 2024

iamcarbon commented Feb 6, 2024

drewnoakes commented Feb 6, 2024

drewnoakes commented Feb 6, 2024

Introduce and use Indexed BufferReader methods #402

Introduce and use Indexed BufferReader methods #402

Conversation

iamcarbon commented Feb 5, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

drewnoakes left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

iamcarbon commented Feb 6, 2024

drewnoakes commented Feb 6, 2024

drewnoakes commented Feb 6, 2024

iamcarbon commented Feb 5, 2024 •

edited