Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Huge allocations #283

Open
erri120 opened this issue Apr 2, 2021 · 2 comments
Open

Huge allocations #283

erri120 opened this issue Apr 2, 2021 · 2 comments

Comments

@erri120
Copy link

erri120 commented Apr 2, 2021

I'm currently in the process of figure out what library to use for getting metadata from media files and this library is definitly one of the fastes around. Only problem I have are the huge allocations it makes:

BenchmarkDotNet=v0.12.1, OS=Windows 10.0.19042
Intel Core i7-7700K CPU 4.20GHz (Kaby Lake), 1 CPU, 8 logical and 4 physical cores
.NET Core SDK=5.0.201
  [Host]     : .NET Core 5.0.4 (CoreCLR 5.0.421.11614, CoreFX 5.0.421.11614), X64 RyuJIT
  DefaultJob : .NET Core 5.0.4 (CoreCLR 5.0.421.11614, CoreFX 5.0.421.11614), X64 RyuJIT

Method Folder Mean Error StdDev Min Max Gen 0 Gen 1 Gen 2 Allocated
ParseWithMetadataExtractor REDACTED 563.2 ms 23.79 ms 15.74 ms 536.6 ms 580.6 ms 44000.0000 22000.0000 10000.0000 230725.6 KB
ParseWithMediaInfo REDACTED 771.5 ms 29.93 ms 19.80 ms 745.8 ms 801.8 ms - - - 13.7 KB

The folder I tested this on contained 144 files (136 Videos and 8 Images total 1GB) and saw allocations of around 220MB.

Method Folder Mean Error StdDev Min Max Gen 0 Gen 1 Gen 2 Allocated
ParseWithMetadataExtractor REDACTED 140.9 ms 9.84 ms 6.51 ms 128.5 ms 147.1 ms 18000.0000 14000.0000 8000.0000 65805.38 KB
ParseWithMediaInfo REDACTED 217.7 ms 15.54 ms 10.28 ms 203.3 ms 231.9 ms - - - 26.01 KB

The next benchmark was done on a folder containing 276 files (only images) and again we see allocations way above reason.

Using the Dynamic Program Analysis build into Rider, the most allocations happen because the library reads the entire contents of a section into a byte array and often processes those later on like for PNG and JPEG.

Possible improvements could be made by using Span<T> and .Slice for the chunks which returns a ReadOnlySpan<T> with no extra allocations.

Then there is also the concept of binary overlays that differ from typical binary importating in that you do not read everything from file into memery upfront and then parse it but keep an open stream and only parse the bare minimum needed to know the file layout. With the layout you can then expose getters that call Lazy functions or similar which then jump to the specific position in the file stream and parse the section on demand instead of up front. This method is extremely useful as the program only needs to actually read and parse what you need so allocations will be kept to a minimum. The biggest problem this has it that it requires some not so small amount of refactoring and API changes.

@kwhopper
Copy link
Collaborator

kwhopper commented Apr 2, 2021

Thanks @erri120

Some testing and implementation has been done through #250 to try and address (most of) the internal byte array issues. It uses a List structure to hold file byte chunks that should be small enough (< 85000 bytes) to avoid LOH allocations, and also allow for multiple reads against the same byte sequence instead of copying (which may be a lot of the allocations you see).

It's still a WIP, quite a change to the library, and may or may not address all allocation scenarios. Hopefully it becomes a small step forward in the future but we'll see how it plays out.

Keep in mind too that MetadataExtractor is designed to always read all possible metadata and create the internal representation. It wasn't designed to read up to a user-defined point or in a lazy manner. A few issues have been made over the years requesting a lazy and/or truncated reading feature but like you said, it would be quite a refactor. I think something along the lines of #250 would need to be in place first to at the very least address all the byte array copying and then go back to these other things.

@drewnoakes
Copy link
Owner

@erri120 thanks for your investigation and suggestions.

With the layout you can then expose getters that call Lazy functions or similar which then jump to the specific position in the file stream and parse the section on demand instead of up front

In general the library doesn't assume that the stream is seekable. Some users process data live off a socket, for example, where seeking is not an option without buffering.

As @kwhopper calls out, he has a promising PR to unify the data access logic, which should open up some more opportunities here. However in the non-seek pattern, there's no guarantee (in general) that later data doesn't point back to earlier data. In such cases we need to buffer data that we skip, in case it's needed later on.

There are some cases where we expect to skip without needing to seek back to the data being skipped (e.g. image data). I believe that would be the most promising approach to cut down on these allocations in the non-seekable case.

In the seek case, it's still quite likely we need to buffer chunks of data due to random access patterns.

A promising option here would be to maintain a pool of these buffers so that they are reused, reducing GC churn.

Possible improvements could be made by using Span and .Slice for the chunks which returns a ReadOnlySpan with no extra allocations.

Span/ReadOnlySpan requires data to be in contiguous memory, so it still needs to be copied from the underlying stream to a buffer. While MetadataExtractor could make better use of these newer language constructs in some places (note though that we still currently support as far back as .NET Framework 3.5), it is not a free lunch. Span is great for avoiding copies of data, but in our case we can't (in general) avoid copying from a stream to a buffer.

I'm not sure how the MediaInfo library you're comparing against works, but perhaps they're not pulling out as much metadata. If you know that you're only after a few specific types of metadata, you can configure a type (such as JpegMetadataReader) to only use specific JPEG reader implementations, which will cut down on the number of byte[] allocations for JPEG segments related to metadata you don't care about.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants