Huge allocations #283

erri120 · 2021-04-02T14:39:23Z

I'm currently in the process of figure out what library to use for getting metadata from media files and this library is definitly one of the fastes around. Only problem I have are the huge allocations it makes:

BenchmarkDotNet=v0.12.1, OS=Windows 10.0.19042
Intel Core i7-7700K CPU 4.20GHz (Kaby Lake), 1 CPU, 8 logical and 4 physical cores
.NET Core SDK=5.0.201
  [Host]     : .NET Core 5.0.4 (CoreCLR 5.0.421.11614, CoreFX 5.0.421.11614), X64 RyuJIT
  DefaultJob : .NET Core 5.0.4 (CoreCLR 5.0.421.11614, CoreFX 5.0.421.11614), X64 RyuJIT

Method	Folder	Mean	Error	StdDev	Min	Max	Gen 0	Gen 1	Gen 2	Allocated
ParseWithMetadataExtractor	REDACTED	563.2 ms	23.79 ms	15.74 ms	536.6 ms	580.6 ms	44000.0000	22000.0000	10000.0000	230725.6 KB
ParseWithMediaInfo	REDACTED	771.5 ms	29.93 ms	19.80 ms	745.8 ms	801.8 ms	-	-	-	13.7 KB

The folder I tested this on contained 144 files (136 Videos and 8 Images total 1GB) and saw allocations of around 220MB.

Method	Folder	Mean	Error	StdDev	Min	Max	Gen 0	Gen 1	Gen 2	Allocated
ParseWithMetadataExtractor	REDACTED	140.9 ms	9.84 ms	6.51 ms	128.5 ms	147.1 ms	18000.0000	14000.0000	8000.0000	65805.38 KB
ParseWithMediaInfo	REDACTED	217.7 ms	15.54 ms	10.28 ms	203.3 ms	231.9 ms	-	-	-	26.01 KB

The next benchmark was done on a folder containing 276 files (only images) and again we see allocations way above reason.

Using the Dynamic Program Analysis build into Rider, the most allocations happen because the library reads the entire contents of a section into a byte array and often processes those later on like for PNG and JPEG.

Possible improvements could be made by using Span<T> and .Slice for the chunks which returns a ReadOnlySpan<T> with no extra allocations.

Then there is also the concept of binary overlays that differ from typical binary importating in that you do not read everything from file into memery upfront and then parse it but keep an open stream and only parse the bare minimum needed to know the file layout. With the layout you can then expose getters that call Lazy functions or similar which then jump to the specific position in the file stream and parse the section on demand instead of up front. This method is extremely useful as the program only needs to actually read and parse what you need so allocations will be kept to a minimum. The biggest problem this has it that it requires some not so small amount of refactoring and API changes.

The text was updated successfully, but these errors were encountered:

kwhopper · 2021-04-02T15:51:27Z

Thanks @erri120

Some testing and implementation has been done through #250 to try and address (most of) the internal byte array issues. It uses a List structure to hold file byte chunks that should be small enough (< 85000 bytes) to avoid LOH allocations, and also allow for multiple reads against the same byte sequence instead of copying (which may be a lot of the allocations you see).

It's still a WIP, quite a change to the library, and may or may not address all allocation scenarios. Hopefully it becomes a small step forward in the future but we'll see how it plays out.

Keep in mind too that MetadataExtractor is designed to always read all possible metadata and create the internal representation. It wasn't designed to read up to a user-defined point or in a lazy manner. A few issues have been made over the years requesting a lazy and/or truncated reading feature but like you said, it would be quite a refactor. I think something along the lines of #250 would need to be in place first to at the very least address all the byte array copying and then go back to these other things.

drewnoakes · 2021-05-24T08:07:14Z

@erri120 thanks for your investigation and suggestions.

With the layout you can then expose getters that call Lazy functions or similar which then jump to the specific position in the file stream and parse the section on demand instead of up front

In general the library doesn't assume that the stream is seekable. Some users process data live off a socket, for example, where seeking is not an option without buffering.

As @kwhopper calls out, he has a promising PR to unify the data access logic, which should open up some more opportunities here. However in the non-seek pattern, there's no guarantee (in general) that later data doesn't point back to earlier data. In such cases we need to buffer data that we skip, in case it's needed later on.

There are some cases where we expect to skip without needing to seek back to the data being skipped (e.g. image data). I believe that would be the most promising approach to cut down on these allocations in the non-seekable case.

In the seek case, it's still quite likely we need to buffer chunks of data due to random access patterns.

A promising option here would be to maintain a pool of these buffers so that they are reused, reducing GC churn.

Possible improvements could be made by using Span and .Slice for the chunks which returns a ReadOnlySpan with no extra allocations.

Span/ReadOnlySpan requires data to be in contiguous memory, so it still needs to be copied from the underlying stream to a buffer. While MetadataExtractor could make better use of these newer language constructs in some places (note though that we still currently support as far back as .NET Framework 3.5), it is not a free lunch. Span is great for avoiding copies of data, but in our case we can't (in general) avoid copying from a stream to a buffer.

I'm not sure how the MediaInfo library you're comparing against works, but perhaps they're not pulling out as much metadata. If you know that you're only after a few specific types of metadata, you can configure a type (such as JpegMetadataReader) to only use specific JPEG reader implementations, which will cut down on the number of byte[] allocations for JPEG segments related to metadata you don't care about.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Huge allocations #283

Huge allocations #283

erri120 commented Apr 2, 2021

kwhopper commented Apr 2, 2021

drewnoakes commented May 24, 2021

Huge allocations #283

Huge allocations #283

Comments

erri120 commented Apr 2, 2021

kwhopper commented Apr 2, 2021

drewnoakes commented May 24, 2021