Improve Code Quality / Tiding #390

iamcarbon · 2024-01-29T18:37:49Z

Simplifies string.Join calls (eliminates various array allocations)
Uses new GetString polyfill
Removes Empty class (using empty collection expression instead)
Eliminates various byte allocation
Updates ShouldAcceptList to use utf-8 bytes
Updates StartsWithJpegPreamble to accept ReadOnlySpan
Switches on integer values to avoid a string allocation

iamcarbon · 2024-01-29T18:38:17Z

@drewnoakes Ready for review. This is the last of the changes proposed for 2.9.

drewnoakes

Looking good. Thanks for fixing all these. Left some comments. See what you think and we'll get this merged.

MetadataExtractor/Formats/Heif/HeicImagePropertyDescriptor.cs

MetadataExtractor/Formats/Exif/makernotes/OlympusFocusInfoMakernoteDescriptor.cs

MetadataExtractor/Formats/QuickTime/QuickTimeReaderExtensions.cs

MetadataExtractor/Formats/Riff/RiffReader.cs

MetadataExtractor/Formats/Iptc/Iso2022Converter.cs

…alFlashZoomDescription

MetadataExtractor/Formats/Exif/makernotes/OlympusFocusInfoMakernoteDescriptor.cs

drewnoakes · 2024-01-30T03:40:10Z

MetadataExtractor/Formats/Iptc/Iso2022Converter.cs

@@ -89,17 +91,24 @@ public static class Iso2022Converter

            foreach (var encoding in encodings)
            {
+                char[] charBuffer = ArrayPool<char>.Shared.Rent(encoding.GetMaxCharCount(bytes.Length));


Nice. There are some other places we could use ArrayPool too actually, like in the reader classes. I'll investigate that separately for 2.9.0, unless you'd like to.

There's also some wins introducing a few new specialized readers that can operate directly over ReadOnlySpan.

One callout, would be replacing SequentialByteArrayReader with an optimized ref struct {LittleEndian/BigEndian}BufferReader(ReadOnlySpan buffer) -- and spanifying the outer method.

This would allow us to operate directly over a span, and eliminate the reader allocation.

There's another big win eliminating all the temporary array allocations when we make (int tagName, params string[] descriptions) calls.

Some more use of ArrayPool in #392.

There's definitely room for improvement in the reader classes. Check out the PRs from @kwhopper for some more ideas.

My vague idea here is to build on the new span types and @kwhopper's investigations, and actually avoid doing as much parsing work during the extraction phase. Many directories we store could actually be backed by a byte[] (or Memory<byte>) that could be inspected when enumerating through tags. This would be quite a big change, and requires some research before it could be pursued.

There's another big win eliminating all the temporary array allocations when we make (int tagName, params string[] descriptions) calls.

This sounds promising. We'd need to verify that the compiler doesn't allocate an array behind the scenes.

EDIT: It seems to do the right thing on modern .NET: https://sharplab.io/#v2:EYLgtghglgdgPgAQEwEYCwAoBBmABAlANnyVwGFcBvTXW/PA4hAFlwFkUAKAJQFMIAJgHkYAGwCeAZQAOEGAB4CABgB8+FEoDOAShp1qGOrgC+e2mfrqmrNkk67D+i0Y6cA2gCIAgh4A0uDwAhPwCyDwBdbQBuC1MMYyA===

...but not on .NET Framework: https://sharplab.io/#v2:EYLgHgbALAPgAgJgIwFgBQcDMACOSK4LYDC2A3utlbjngXFNgLJIAUASgKYCGAJgPIA7ADYBPAMoAHboIA8eAAwA+XEgUBnAJSVqFNNWwBfHVRM1V9RkwStt+3WYMtWAbQBEAQTcAabG4BCPn7EbgC6mgDcZsZohkA==

This is a compiler feature, so we'd need to test netstandard2.1 csc output, which sharplab doesn't support afaik.

There's definitely room for improvement in the reader classes. Check out the PRs from @kwhopper for some more ideas.

Thanks. RandomAccessStream+ReaderInfo is an experiment somewhat similar to this Span conversion. It goes a bit further by abstracting away all the buffering (RandomAccessStream) and "span-ifying" (ReaderInfo with byte arrays) entirely; callers then only have to worry about one kind of reader. Side effects are the ability to know your exact physical offset at any time, and support for streamed content.

If you can reach those same goals in this process, that's a great addition and should allow new things in the future. I can also check through the code in those old PR's to see if they could use Spans like you're doing here, if that has some value.

@kwhopper the recent activity has been small incremental improvements. I see your PR as the kind of thing we want for 3.0. It'll be a lot of work to integrate throughout the code though, so we should noodle out the details with sketches and discussion before starting the work of integration. We only want to do that integration work once.

A direction I think would be good is to divide the parsing into two stages:

Reading the file in coarse chunks. E.g. for JPEG, this could be just pulling out and labelling the segments we need. These would be allocated in contiguous chunks of memory, with no further processing. I think this phase could mostly be done sequentially. The chunks would remember their offsets relative to the start of the file too, along with whatever metadata is needed for later steps.

As the consumer walks through the metadata, we process the chunks of data to produce the tags.

Currently we do both 1 and 2 during the read phase. I'm thinking that, with this, we'd just do step 1 during that phase, and step 2 during the enumeration. This will mean a lot less work and fewer allocations during the first phase, and when that work's done during the second phase, any allocations would be shorter-lived and therefore more likely to be GC'd quickly in gen0. It'd also allow consumers to skip decoding bits they don't actually care about.

I'm hoping to write this up a bit more comprehensively and would really appreciate your input.

drewnoakes · 2024-01-30T03:45:37Z

MetadataExtractor/TagDescriptor.cs

+                return Encoding.UTF8.GetString(values)
                    .Trim('\0', ' ', '\r', '\n', '\t');


I was wondering if we could trim a span rather than a string, but I don't think it's safe to trim these bytes in all UTF8 strings, and that trimming characters is better. It would be possible to use the Encoding to populate a Span<char> and trim that, but I don't think it's worth it.

My plan is to look at some traces and take a data-led approach to the next wave of optimisations. There's too much code to go through.

This is also a good case for the ArrayPool, where we trim the char[] buffer, before materializing the string. Agree that we need to be careful operating on bytes when the string might have multi-byte codepoints.

We should also try to eliminate some of these silently allocated arrays when using functions that accept a params T[] array (as is the case above).

drewnoakes

Amazing, cheers!

drewnoakes · 2024-01-31T11:10:37Z

MetadataExtractor/Formats/Exif/makernotes/OlympusCameraSettingsMakernoteDescriptor.cs

@@ -651,15 +651,15 @@ public sealed class OlympusCameraSettingsMakernoteDescriptor(OlympusCameraSettin
            if (Directory.GetObject(OlympusCameraSettingsMakernoteDirectory.TagGradation) is not short[] values || values.Length < 3)
                return null;

-            var join = $"{values[0]} {values[1]} {values[2]}";
-            var ret = join switch
+            var ret = (values[0], values[1], values[3]) switch


Oops, this should have been a 2 instead of a 3. I'll push a fix shortly.

drewnoakes · 2024-01-31T11:29:10Z

MetadataExtractor/Formats/QuickTime/QuickTimeReaderExtensions.cs

-            var sb = new StringBuilder(4);
-            sb.Append((char)reader.GetByte());
-            sb.Append((char)reader.GetByte());
-            sb.Append((char)reader.GetByte());
-            sb.Append((char)reader.GetByte());
-            return sb.ToString();
+            Span<byte> bytes = stackalloc byte[4];
+
+            reader.GetBytes(bytes);
+
+            return Encoding.ASCII.GetString(bytes);


This causes a behaviour change for some inputs, though I've only seen it on fuzzed files that contain very weird data:

The ASCII encoding replaces some characters with ? which potentially loses information. According to https://en.wikipedia.org/wiki/FourCC non-printable characters are valid. I'm not sure it's a problem in practice, but I think I'll make a change here to restore the old behaviour.

iamcarbon added 10 commits January 29, 2024 09:29

Simplify string.Join calls

a8a7592

Fix formatting

99e3fae

Use new GetString polyfill

03a8a06

Remove Empty class

731ada7

Eliminate byte[] allocation in QuickTimeMetadataReader

ad1e954

Eliminate allocation in Get4ccString

3fcef87

Update ShouldAcceptList to accept utf8 bytes

1ddd627

Update StartsWithJpegExifPreamble to accept ReadOnlySpan<byte>

0443c54

Use switch expression

7c5a6ca

Switch on integer values to avoid string allocation

b229f63

drewnoakes reviewed Jan 30, 2024

View reviewed changes

iamcarbon added 6 commits January 29, 2024 18:51

Use switch statement in OlympusFocusInfoMakernoteDescriptor.GetExtern…

25d2fea

…alFlashZoomDescription

Use pattern matching to simplify another string.Join condition

f8ef4c4

Simplify GetString call

76f35fa

Use Encoding.ASCII.GetString

48f9cba

Eliminate byte[] allocation in RiffReader

446e295

Eliminate string allocation in Iso2022Converter

21f5e47

drewnoakes reviewed Jan 30, 2024

View reviewed changes

MetadataExtractor/Formats/Exif/makernotes/OlympusFocusInfoMakernoteDescriptor.cs Show resolved Hide resolved

drewnoakes reviewed Jan 30, 2024

View reviewed changes

drewnoakes approved these changes Jan 30, 2024

View reviewed changes

drewnoakes merged commit 06b0bd0 into drewnoakes:main Jan 30, 2024
2 checks passed

drewnoakes reviewed Jan 31, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve Code Quality / Tiding #390

Improve Code Quality / Tiding #390

iamcarbon commented Jan 29, 2024

iamcarbon commented Jan 29, 2024

drewnoakes left a comment

drewnoakes Jan 30, 2024

iamcarbon Jan 30, 2024 •

edited

drewnoakes Jan 31, 2024 •

edited

kwhopper Jan 31, 2024

drewnoakes Feb 1, 2024

drewnoakes Jan 30, 2024

iamcarbon Jan 30, 2024 •

edited

iamcarbon Jan 30, 2024

drewnoakes left a comment

drewnoakes Jan 31, 2024

drewnoakes Jan 31, 2024

drewnoakes Jan 31, 2024

drewnoakes Jan 31, 2024

		return Encoding.UTF8.GetString(values)
		.Trim('\0', ' ', '\r', '\n', '\t');

Improve Code Quality / Tiding #390

Improve Code Quality / Tiding #390

Conversation

iamcarbon commented Jan 29, 2024

iamcarbon commented Jan 29, 2024

drewnoakes left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

iamcarbon Jan 30, 2024 • edited

Choose a reason for hiding this comment

drewnoakes Jan 31, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

iamcarbon Jan 30, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

drewnoakes left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

iamcarbon Jan 30, 2024 •

edited

drewnoakes Jan 31, 2024 •

edited

iamcarbon Jan 30, 2024 •

edited